We run a lot of A/B tests at Thumbtack. Because we run so many A/B tests at such a large scale, we want to make sure we run them correctly. One issue we’ve run into when running A/B tests is that a difference could still exist between the test and control groups by chance — even if we randomize. This causes uncertainty in our online A/B tests. In these cases, the question we need to answer is: if we observe a difference, is it because of the test feature we just introduced, or because the difference is pre-existing?
We propose that for online designed experiments, a proper randomization procedure enables us to attribute an observed difference to the test feature, instead of to a pre-existing difference in the test groups. Implementing this approach has given us greater confidence in our A/B test results than we had previously. We have drawn from the PhD thesis of Kari Frazer Lock in developing this approach to our A/B tests.
To illustrate the problem, let’s consider an example. Suppose our engineering team has decided to play a friendly game of tug-of-war after eating their favorite superfoods. To test out which kind of food helps people win, we randomly assign them into two groups. Team A channels their inner Popeye and eats spinach salad, and team B decides to chow down on their superfood of choice: kale. Team A wins and claims that spinach induces superior strength. Is that so?
We all have different heights, weights, inherent strength, etc. Suppose that by chance alone, team B members have an average weight that is 15 pounds lighter than team A, and all the folks over 6 feet tall ended up in team A. Plus, team A also got 18 engineers and team B got 17. Adding up all the differences, team A ended up with a disproportional advantage!
Spinach, that wasn’t fair!
Variation in Test Units
In online A/B experiments, the test unit is a usually a user or visitor. Users differ greatly among each other: some visit very often, some have faster internet, some use mobile exclusively, etc. When each user interacts repeatedly with our site, we can proactively seek to balance out users based on historical data.
Randomization alone is not enough
Even if we randomize the initial assignment of uses between test and control groups, a difference could still exist between test and control groups that is due not to the test feature but rather due to chance alone. This causes uncertainty in our online A/B tests. If we observe a difference, is it because of the test feature we just introduced, or because the difference is pre-existing?
If we have baseline characteristics for test subjects, we can try to balance the test and control groups on these characteristics before we run the experiment so any observed difference can be attributed to the test feature.
Obviously, this is only feasible when we have some information on the test subjects. When Thumbtack was relatively new and most observed interaction between users and the product came from new users, this step could not be done. Now that Thumbtack gets a lot of repeat visits, we can strive to balance out experiments in a way we weren’t able to do previously and thus get more accurate measures which give us more confidence in our conclusions.
The chance that at least one of the test groups has different baseline characteristics rises as variation increases among test units and as number of samples in each group declines (e.g. when we test multiple variants in a single experiment).
The Solution V1: A/A Tests
One natural solution is to run an A/A test on the test groups before the test feature is introduced. For example if we will roll out a feature in the next month, we can assign the users according to some rule into two groups, and measure their metrics in last month’s data. In the month before our test, the test and control groups should have no pre-existing difference, and thus are “A/A” as opposed to “A/B” tests. Using our tug-of-war example, an “A/A” test would be a game, with both teams on the same diet.
Historically this is how randomized experiments are done in the biomedical field. In any published paper on such studies, the very first section is to establish there is no existing imbalance (the famous “table 1″). And if there are any, they can still be accounted for in downstream statistical analysis.
For online A/B tests for any web facing company, running A/A tests is a relatively cheap solution for a step in the right direction. When an A/A test shows a pre-existing imbalance, i.e. the test “fails”, we should take caution in interpreting the A/B test result. Depending on the severity of the imbalance we can choose to ignore, statistically adjust, or re-run the experiment.
But there has to be a better way than waiting to see if an “A/A” test fails?
The Solution V2: Repeated Re-randomization
In online A/B tests, test units are usually assigned according to their id. And a random seed ensures each experiment uses a different randomization. Usually a hash function, say, SHA1, takes the seed and user id, and turn that into an integer. Then these integers are split into test groups.
We can repeatedly compute A/A tests results until we have found a split where all A/A tests are flat. This step can greatly reduce the chance of a failed A/A test run on the pre-experiment period.
It turns out to be quite simple in theory! – Randomly select a seed and randomize test subjects by this seed. – Run the A/A test on all metrics of interest. – If the A/A test fails on any dimension, discard and go back to Step 1.
This way we will end up with a seed that can balance test subjects. This procedure can go through anywhere from tens to thousands of seeds before finding a balanced one. The number of seeds you need to go through to find a balanced one depends on how many baseline characteristics we want to balance on and on the amount of variation among subjects.
In theory it is possible for our historical metrics to be correlated so that the repeated procedure can take unreasonably long to find a proper assignment. In practice, we keep an upper bound in number of trials M, and we trace all seeds with corresponding minimum p-values across all the baseline characteristics. If the procedure fails to find an optimal seed that sufficiently balances the treatment variants within M steps to the pre-specified thresholds, we return the best result for a human to judge. This way we guarantee the procedure has a stop point.
What does the human judge do? Based on domain knowledge and business priorities, this human (a data scientist at Thumbtack) can choose either to re-run the procedure, or decide if the best of M results is good enough, or further scrutinize metric computation and selection via offline analysis. Waterproof solution?
But does this procedure guarantee perfect balance in our test and control group every time?
No. It only minimizes imbalance the best we can. Potential reasons include: Randomness. Observations from random variables have this inherent random nature due to endogenous and exogenous reasons.
- Existing users can change their behavior, independent of our test feature.
- Within-company changes, e.g. an ad campaign could start and affect regions in only one of the test groups.
- Another team could start an experiment in the next month that inadvertently and partially overlaps with our experiment.
- A subset of users shows strong seasonal difference, e.g. snow plowing and yard work.
- Externalities, e.g. competitor that targets a certain segment could show up and affect the whole marketplace.
- New users may sign up during the test period. We cannot balance new visitors, we can rely only on randomization and thus imbalance could occur.
The solution we have developed is not waterproof. However, implementing it in our A/B tests has given us greater confidence in our A/B test results than we had previously.
Empirical Results via Simulation
To illustrate, we simulate three metrics X, Y and Z, measured over N users, from a multivariate normal distribution with pre-specified mean and variance-covariance structure, and randomly split them into two groups, and test for difference, i.e. perform “A/A” tests. In the following examples, we assume a total of 50,000 users, two equal sized variants, we simulate 100 rounds and count the number of false positives for each metric.
Case 1: independent normal
When X, Y, and Z are independent, we expect p-values from the “A/A” test to follow a uniform (0,1) distribution. It is then trivial to compute the expected number of false positives when we repeat the simulation K=100 times, i.e. roughly 5 significant in X, 5 in Y, 5 in Z. Indeed, we observe 2 in X, 6 in Y, and 4 in Y.
Delta and its 95% confidence interval clearly shows, after re-randomization, the “A/A” test shows much better balanced groups.
Case 2: independent metrics, one log normal
Of course, rarely are we so lucky to have all normally distributed metrics. So now, let’s change things up by making Y follow a log-normal distribution. Similarly, in terms of false positives before re-randomization, there were 5 in X, 6 in Y and 6 in Z. After re-randomization, everything is well balanced.
Case 3: independent discrete values
What if our metric value is discrete, let’s check by making Z into a discrete variable. There were 6 false positives in X, 3 in Y and 6 in Z.
Case 4: Correlated metrics
Finally, we investigate correlation between metrics X, Y, Z. It is trivial to derive the expected number of false positives, we leave that as an exercise for readers, as well as why it is OK to use z-test in all of the above situations. Here, as an arbitrary choice, X and Y are moderately positively correlated, with correlation coefficient of 0.5, X and Z are mildly positively correlated with coefficient of 0.2, while Z has mildly negative correlation with Y (-0.1). With positively correlated X and Y, they had 11 and 6 false positives each ,while Z had 6.
In all of the cases above, it is clear that such a procedure improves balance, and can help us draw better inference in subsequent A/B tests.