Part III: A/B Testing at HomeToGo — Running the whole A/B pipeline on Snowflake

Published in

HomeToGo Data & Engineering

12 min readDec 15, 2023

A/B testing is the bread and butter and de facto standard of almost every digital company for experimentation.

This article is the third and last chapter of a three-part series (find links to the first chapter here, and the second chapter here). Throughout this collection, we aim to share our experience at HomeToGo with user testing in terms of fundamentals, governance, and the very specific details of how we perform calculations on Snowflake.

If you are interested in user testing and work in the product/data space, stick around!

Running the whole A/B pipeline on Snowflake

In September 2022, we migrated our A/B testing raw tables from offline SQL scripts to dbt, which was a big step towards reducing complexity. However, the calculations were still performed by an R script that did not run in the data warehouse. This created significant overhead: We needed to generate the input for the statistics script, execute that script elsewhere, and then import the statistics back into the data warehouse for further analysis and reporting. To circumvent this, we opted to move the pipeline entirely into Snowflake. This enabled us to:

Reduce complexity (e.g. get rid of importing/exporting data to and from the statistics calculator)
Create more transparency in terms of lineage (e.g. who consumes what) and stats calculation (e.g. how we calculate p-values and confidence intervals)
Centralize statistical function definitions by leveraging Python UDFs, which can be reused across the company.

Calculating p-values and confidence intervals for different types of metrics

When we calculate test statistics, our base table is at a user level: We aggregate the data on a user grain, and build the rest upon that. As we mentioned before, users are our unit of randomization and are assumed to be iid (Independent and Identically Distributed).

We deal with three types of metrics when A/B testing at HomeToGo:

Binomial metrics: There are two possible outcomes per unit of randomization (in our case, per user) — true or false
Continuous metrics: Numeric calculations on the unit of randomization (in our case, we calculate the given metric for each user and apply a test to that)
Ratio metrics: Numeric calculations that only make sense when considering the whole group of users (calculating these on a user level would not make business sense)

After removing outliers, we proceed to calculate the metrics as well as associated p-values and confidence intervals. As explained below, each type of metric needs to be treated differently.

Binomial metrics

An example of this metric is users with a booking. These types of metrics are the most robust ones, and are purely binomial (e.g. yes or no on a user level). We used chi-squared tests to calculate p-values and estimate the confidence interval of the uplift.

Calculating p-values and confidence interval estimates of the uplift analytically is the cheapest and most efficient option, simply because we do not need bootstrapping at all. Bootstrapping is time-consuming, computationally expensive, and relatively complex to maintain.

In other words, analytical calculations and estimations of p-values and confidence intervals help us not only reduce costs, but also lower the marginal cost of adding new metrics to the pipeline (which will likely be needed as the company grows and we launch new products or features). If we have a preference for robustness, influence of outliers, and the cost of calculating stats, binomial stats win across the board.

Continuous metrics

These metrics are still on a user grain, but we are interested in the continuous number (e.g.: revenue per user). Therefore, we use t-tests to calculate the p-values and bootstrapping for the confidence intervals. Bear in mind that you are allowed to use t-tests when the distribution of the variable is not normally distributed, as long as the mean of the underlying variable is normally distributed (see also this paper).

Ratio metrics

An example of a ratio metric is revenue per impression. We can only calculate this type of metric once per variation across all users. How can we derive a p-value and confidence intervals from this?

The answer for us is bootstrapping (also called sampling with replacement). Bootstrapping allows you to calculate p-values and confidence intervals for almost anything, but you need to consider that it’s the most expensive calculation path. It’s important to ensure that the benefits outweigh the costs.

In a nutshell, you draw a sample from the pool of users N times and calculate the uplift of the ratio metric each iteration. You can then calculate descriptive statistics based on the estimated distribution of the uplift that bootstrapping generated for you. See this article for a walkthrough of how to implement bootstrapping on Snowflake.

With a sufficiently large number of samples (in our case, N above 900), you can assume normality and use a t-test or z-test for the p-values. It’s worth noting the difference between a distribution tending to normality and being normal: If we can’t ensure that it is normal, using the t-test is the right option, even though the z-test will likely yield extremely similar results.

For the 95% confidence intervals, this becomes a bit easier compared to the p-value calculations: We generate the distribution of the difference between v0 and v1, and then simply calculate the 2.5th and 97.5th percentiles.

Validity checks implementation

Penalizing p-values to reduce peeking & false positives

What is peeking?
“Peeking”, or sometimes “unaccounted peeking” and more precisely “unaccounted peeking with intent to stop” is a term used to describe the poor practice of checking the results of an ongoing A/B test with the intent to stop it and make a decision or inference based on the observed outcome: value of the variable of interest, reaching a given significance threshold, etc. Peeking is a significant threat to the validity of any online controlled experiment as it can uncontrollably increase the type I error rate and render any significance calculations or confidence intervals meaningless.
It is a poor practice since it causes a discrepancy between the nominal p-value you calculate and the actual p-value which reflects the probability of observing the result you observed under the actual circumstances. With just a few looks at the data the actual significance can be orders of magnitude larger than the nominal.
Source: Analytics-toolkit

What to do about peeking?

We definitely recommend taking peeking into account when calculating and showing p-values. We do this via correcting the nominal p-values so that they get as close to the actual p-values as possible.

Let’s take a step back and think of experimentation in a digital product. When we are running tests we basically want two things:

Their results to be rigorous
To spot early on variations that break the user experience

As you may have realized, (b) looks somehow juxtaposed to (a). If you run a test for a month, until you reach the minimum sample size, your results will be rigorous (a), but what if a bug in the checkout was dragging conversion rate down by 10%? Then 50% of the traffic * 10% decrease = 5% total loss. On the other hand, if you always look at results and p-values, you would spot (b) but you are also peeking and the results will be non rigorous. It would also be hard to separate the signal from the noise. The answer must be something in between.

What has worked for us quite well, as well as for other companies in the industry, is penalizing p-values. The reason behind this is factoring in uncertainty: A p-value of 0.1 reflects a different underlying level of uncertainty if it is the result of a calculation with 1000 users or with one million users. P-values do not reflect this: By design in the process, a p-value is calculated only once, not multiple times in the course of a controlled trial.

The idea is that the corrected p-value converges to the classical p-value if the sample is large enough. In any other case, the corrected p-value will always be higher than the classical p-value (one can interpret this as “less significant”). How much you penalize your p-value in the beginning, and when you make it converge with the classical p-value, depends largely on your business, your data, and how complex you want this calculation to be.

We recommend building this off your rarest event (in our case, users who book) and adding a few parameters you can use to fine-tune sensitivity. Bear in mind that if a change across variations is very extreme (e.g. a drop of 80%), you still want to see it even if you only have a few users! What comes next is understanding how the p-value penalization affects power and significance in your data, which you can do through simulation. We simulated both A/A tests (to check the false positives ratio) and uplifts (to see power).

Simulation will help you iterate and find the hyperparameters that work for you. Running sanity checks on this and readjusting hyperparameters, particularly if the frequency of your rarest event changes, is highly recommended.

In case you want to read more, we recommend checking Etsy’s example and approach on the matter.

Ensuring power of a given metric

The idea of power is intrinsic to testing (you can check what power in testing is here). In this post we want to focus on two situations we often encounter where the power concept comes in handy:

Most of my tests come back neutral, why?
I ran this test for 3 days and it looks very significant, are we safe to release this?

The situation in (a) suggests that you are either testing impactless material, or that you do not let your tests run long enough. Estimating the sample size can be painful — especially if the test shows you that, given your traffic and expected impact, you won’t be able to draw any conclusions in a reasonable time. But this is fine! It’sbetter to know this in advance than to blow up resources to only find it out later. Estimating the sample size in advance is needed and useful. However, sometimes the expected effect is different than what you see in the test. — Recalculating power is not perfect, but it gives you a rough idea about where you stand in terms of traffic that you need.

On the other hand, the situation in (b) is one where you can profit from your scepticism the most: If it looks too good to be true, it likely is. A metric can be at a given point in time significant yet unpowered. When we encounter these cases, we think about them as“it looks like it’s going in a positive direction, yet we need to run this a bit longer to make a call on significance.”

To help end-consumers understand where their tests stand in terms of power, we show if a metric is powered next to the significance value. As the calculations are way more simple for binomial metrics than for continuous metrics, we only do it for the former (link to the code here). If you know that a binomial and a non binomial metric are heavily linked (e.g. users with a booking and revenue per user in our business) you can check that the binomial one is powered. From here, you can infer that the revenue one would be too, or check how many users you would need. It’s not perfect but it’s good enough, and it’s easy to implement. In our case, this is enough.

Binomial tests to check proper user assignment

With the exception of canary tests, which are by design different (e.g. one may assign 99% traffic to control and 1% to treatment), we expect users to have equal chances to enter either variation. This is a fundamental assumption we make when calculating stats. When this does not hold, we don’t display any results, except for totals, to prevent misinterpretations.

To check if the assignment likely follows a binomial distribution, we calculate a binomial test with p=0.5. In our case, the trials are users in control or treatment, and our goal is to refute that users had different chances of falling into control or treatment. The test returns a p-value that, when below a certain threshold, can be interpreted in our case as “chances that the difference in user assignment we see is caused by randomness”.

--implementing a test to check if the sampling between v0 and v1 looks like follows a p=0.5 binomial distribution https://en.wikipedia.org/wiki/Binomial_test
(case
when abs(metric_change) >= 0.5
and abs(metric_change) <= 70
and metric_value >= 5000
then abs(metric_value - 0.5 * (metric_value + metric_value + metric_change_abs)) / sqrt(0.5 * 0.5 * (metric_value + metric_value + metric_change_abs))
else 0
end >= 5)::int as bad_assignement

When we first implemented this, we used to interpret the binomial test’s Z-Value>1.96 as “95% chance that users did not have the same likelihood of falling into control or treatment”. However, when doing this, we became victims of peeking and consequently experienced a flood of false positives. The adjustments we’ve made to its current implementation catches the most extreme cases and keeps noise quite low. While these conditions suit our needs, make sure you adjust them to cater to yours.

The conditions we meet so that the flag goes off include:

Z-Value > 5
User change > 0.5%
User change < 70% (as a heuristic to filter out canary tests)
At least 5000 users in control

A few notes on user assignment

The iid assumption we make when testing

When we A/B test, we are making use of statistical devices that are purposeless if the right assumptions don’t hold (as your old Stats 101 book still points out).

We expect the unit of randomization to be iid because we randomize on a user level. This has a few implications when we run the binomial test to check proper user assignment (section above):

We want to check that the assignment to control or treatment is completely randomized. Except for specific cases, such as canary tests, the likelihood of falling into treatment of control is theoretically the same (with the most common traffic assignment being 50/50).
Our ultimate assumption is that what one user does is completely independent from what any other user does.

Considering only affected users (a.k.a. “Long tails”)

We often come across situations where only a small portion of our website visitors are exposed to a change during a test. For example, think about a change in the payment selection process. Only a small percentage of our users actually reach this stage. Therefore, we only want to include a user in the test when their experience has been influenced by the changes we’re testing.

This concept of including only affected users in a test is what we refer to as “long-tails.” In other words, we consider a user and their interactions as part of a test only after their experience has been impacted by the changes we’re testing. For example, if we modify the way payment methods are displayed, we will only assign users to the test who have actually seen the payment methods.

However, having the ability to select which users participate in a test also comes with responsibility. We must ensure that our testing process remains rigorous and scientific.For this reason, we check for adequate user assignment systemically, as it helps us maintain fairness in user assignment to variation or control.

In our experience, one common reason for variations in user counts across different versions is the activation of “long-tails” at different stages. For example, in the case of changing payment method sorting, if one version merges previous checkout steps, it may appear that more users in that version are interacting with payment methods. In reality, this is because of the change in the stage being tested.

Last but not least, even when randomization is working as expected and the long-tail activation point is the same, we sometimes still observe significant differences in user counts between the control and treatment groups. What we have often seen is that the test may be affecting page load times, causing users who might have bounced in one version to stay in the test in another.

Currently, we haven’t found a solution to address this issue completely: Filtering out users who didn’t bounce is an option but it’s not a comprehensive one. Instead, we focus on looking at the overall impact on core metrics. One needs to bear in mind that all metrics that rely on users or sessions (e.g. impressions per user or impressions per session) are directly affected by this phenomenon. However, total metrics (e.g. total impressions) should not be as impacted because users who enter one version (and not the other) are typically low-intent users who should not significantly affect the total results.

Thanks Athene Cook, Stephan Claus, Hiep Minh Pham, Domenico Boris Salvati and Breanna Shepard for your feedback and reading drafts of the whole series.