Thumbtack

ABBA

A/B Test (Split Test) Calculator

Label Number of successes Number of trials

Add another group

Frequently Asked Questions

What is Abba?

Abba helps you interpret the results of binomial experiments. In this kind of experiment, you run a number of trials, each of which end in either a "successful" outcome or a "failure" outcome. Trials are divided into two or more groups, and the goal of the experiment is to draw conclusions about how the chance of success differs between those groups. This usually boils down to determining if the success rate is higher for one group than for another.

In the world of consumer web, some common examples would be:

In all of these examples, the groups are the different variations, and the "successful" outcome is the one where the user completes the desirable action: signing up, clicking through, or purchasing.

Abba handles binomial experiments with any number of groups. One group is always designated as the baseline and all other groups are compared against it. Abba computes a few useful results:

As you may have noticed, Abba updates the URL fragment for each report you generate, so you can easily share reports by copying the URL and sending it to your friends, coworkers, Twitter followers, blog readers, or maybe just Mom.

Abba was motivated by the experiments we run daily to make data-informed product decisions here at Thumbtack (shameless plug: we're hiring!). Abba's interface was inspired by Google's excellent Website Optimizer tool.

How does the code work?

abba/stats.js implements all of the statistics (detailed below). It relies on the jStat library for approxiations to normal and binomial distribution functions.

abba/render.js displays the test results given trial and success counts for each group. It can be reused in your own applications — the Abba class provides a friendly interface, e.g.,


abba = new Abba.Abba('Existing version', 20, 100);
abba.addVariation('Red button', 25, 100);
abba.addVariation('Green button', 30, 100);
abba.renderTo($('#results'));
        

render.js isolates the DOM manipulation code in view classes so that the rest of the logic (in ResultsPresenter) can be easily tested. We use Protovis to render the visual confidence intervals.

The rest of the demo app is implemented in demo/app.js as both a useful tool and an example of how you might use Abba as a library yourself. It uses js-hash for history handling.

All three modules are unit tested using Jasmine, and of course we rely on the wonderful jQuery.

You can find all of the source in our github repo.

What's a p-value?

Suppose we run an experiment with two versions of a signup form, in which 100 (randomly chosen) users see an experimental form B and the remaining users see the existing form A. Suppose more users sign up with form B than with form A. We'd like to conclude that we've found a real improvement: form B actually has a higher "true" signup rate. Unfortunately, there's another possible explanation: that both forms have the same "true" signup rate, but form B "got lucky" and had more signups just by random chance.

The p-value helps quantify our confidence that form B really is better and didn't just "get lucky". If we assume the two forms actually do have the same signup rate, the p-value is the probability of observing a discrepancy as large as the one we actually saw. A p-value of 0.01 means that if the two forms really are the same, then we just happened to get a 1 in 100 outcome. That's pretty unlikely, and in most cases we'd be willing to accept that form B really is better. With a p-value of 0.25, on the other hand, the observed discrepancy isn't all that unlikely even if the two forms have exactly the same signup rate. We can't conclude that they acutally do have the same signup rate — all we know is that we can't conclude they're different (and that we need more data to have any confidence in such a conclusion).

It turns out there's another interpretation for the p-value. Suppose we accept any result with a p-value below 0.05. Then if we run many such tests where the two forms actually are the same, we'd expect about 5% of cases to generate a discrepency large enough for us to accept, just by chance. So the p-value can be thought of as the chance of accepting a false positive over many such experiments.

As noted above, the p-value says nothing about the magnitude of the difference between the two forms. The p-value is only distinguishing between "they're the same" and "they're different", but "they're different" may mean "they're different by 0.001%". If you have a billion users try each form, you'll probably get a tiny p-value, but that doesn't mean there the difference is meaningful to your business. The relative improvement confidence interval tends to be much more useful for making practical decisions.

That said, the usual question about p-values is how low does it have to be? The answer, of course, is it depends. If you're a startup testing small copy changes on a page, maybe a p-value of 0.1 is fine — better to move quickly and be wrong 10% of the time. On the other hand, if you're testing a prototype of a page that will take months to fully implement, maybe you want to wait for a p-value of 0.01 or 0.001 before investing serious resources in the implementation. Unfortunately, the only universal rule is that you have to use that squishy thing between your ears.

What exactly do the confidence intervals mean?

The goal of any experiment is to estimate some "true" information about the real world given a limited amount of observed data. In the case of Abba, we want to estimate the success rate of each group and the improvement of each group over the baseline. We obviously don't expect these estimates to be exactly correct, but we do expect them to get more accurate when we have more data.

A confidence interval is an attempt to quantify this effect. Roughly, it can be thought of as a range of values that we can be reasonably confident includes the "true" value. These are useful because they give more practical information and generally offer more insight than a p-value alone.

For example, a low p-value may tell us that a variation is confidently better than the baseline, but the confidence interval on relative improvement could tell us that it's confidently at least 10% better than the baseline, or perhaps no more than 10% better, in which case we may choose to ignore the results anyway. On the flip side, a high p-value may show that we need more data to conclude a variation is better than the baseline, but the relative improvement confidence interval may tell us that we can confidently expect no more than a 20% improvement. If that's too small an improvement, we may choose to abandon the test and explore other ideas rather than wait for more data to achieve a lower p-value. We can never conclude that two variations have exactly the same success rate, but the relative improvement confidence interval can give us bounds on how close they (probably) are. (Drawing such conclusions does tend to require large sample sizes.)

More precisely, a 95% confidence interval means that if we were to run many such experiments under the same conditions, about 95% of the time the confidence interval we got would include the "true" value. If this sounds remniscent of p-values, that's because the two are closely related. When the 95% confidence interval on relative improvement has one of its limits at 0%, the p-value will equal 0.05. (This isn't exactly true in Abba because the p-value and the confidence interval are computed in slightly different ways, but it should be nearly true in most cases.)

The confidence intervals in Abba can generally be taken as having the confidence level given in "Interval confidence level", 95% by default. This is a safe interpretation but isn't exactly true because the confidence level is adjusted to account for multiple testing (unless you uncheck "Use multiple testing correction"). You may notice that sometimes the success rate confidence intervals clearly overlap, but the confidence interval on improvement does not include zero. This naturally happens because some of the error "disappears" when we take the difference of the two success rates. You can find more details below.

As noted in What's a p-value?, you may wish to use confidence levels higher or lower than 95% depending on your use case.

Why do all of the results change when I add a trial?

The primary reason to use Abba is to avoid drawing conclusions based on observed anomalies that were actually due to chance. In an experiment where we test many groups against a single baseline, we're effectively running many experiments at once. If we rely on the usual p-values and confidence intervals, we're fooling ourselves into a false sense of security. The usual values limit the risk of a false positive result for each individual group, but the risk of at least one false positive from any group is larger. The more groups we test against the baseline, the larger this risk grows.

To avoid this problem, the p-values and confidence intervals are automatically adjusted by default based on the number of groups being compared to the baseline. The adjustment is such that the confidence level of any conclusions drawn applies across all of the groups present. More groups means higher p-values and wider confidence intervals. You can find more precise details below.

You can uncheck "Use multiple testing correction" to disable this correction for both confidence intervals and p-values. This is not generally recommended.

What are the underlying statistics?

The confidence interval on success rate is computed using the Agresti-Coull interval (also called the adjusted Wald interval). The confidence interval on relative improvement is computed by treating each success rate as a normal random variable corresponding to the Agresti-Coull interval. The absolute improvement is then the difference of two normal random variables, so that if the two success rates are \( S_1 \sim N(p_1, \sigma_1) \) and \( S_2 \sim N(p_2, \sigma_2) \), then \[ S_2 - S_1 \sim N \left( p_2 - p_1, \sqrt{\sigma_1^2 + \sigma_2^2} \right) \] The relative improvement confidence interval simply divides the endpoints of the absolute improvement confidence interval by the baseline success rate's "mean" value.

The confidence level for all intervals starts with a base confidence level \( 1 - \alpha_\text{base} \) which is set by the "Interval confidence level" option. This is adjusted for multiple testing using the simple Bonferroni correction: \[ \alpha = \frac{\alpha_\text{base}}{N} \] for \( N \) tests.

To compute the p-value, we use the difference of observed proportions \( p_2 - p_1 \) as the test statistic and test the null hypothesis \( H_0: p_1 = p_2 \) against the alternative \( H_1: p_1 \neq p_2 \). Note that we're using the observed proportions directly here, not the confidence intervals described above. We always compute a two-sided p-value, and we rely on the pooled proportion as in a pooled Z-test.

To account for multiple testing on an individual basis, if we have \( N \) groups to test against the baseline, we pretend all \( N \) groups have the same data and compute the probability that any group's test statistic is at least as extreme as the observed value. This isn't entirely straightforward to compute. We could rely on Boole's inequality and use \( Np \) (as in the Bonferroni correction), but this is far too conservative to be practical. If we treat the tests as independent we could use \[ p_\text{multiple tests} = 1 - \left( 1 - p_\text{single test} \right)^N, \] (as in the Sidak correction), but because all of the tests share a single baseline group they're not independent, they're positively correlated. Dunnett's correction is designed for this case, but it would rely on the normal approximation here which can be poor for small sample sizes.

To get around that problem, we condition on the baseline success count \( B \) to compute partial p-values. In that situation (and treating total trial counts as nonrandom for all groups), the tests are independent and we can rely on the above formula to get an exact conditional value. We can then use the law of total probability to find the total p-value, \[ p_\text{multiple tests} = \sum_{i=0}^{n_b} \Bigl( 1 - \bigl( 1 - p_{\text{single test} | B=i} \bigr)^N \Bigr) \mathbb{P}(B=i), \] where \( n_b \) is the number of trials in the baseline group.

This scales like \( O(n_b) \) and can get very slow, so in practice we only iterate over a \( 1 - \alpha_p \) confidence interval for \( B \) and then add \( \alpha_p \) to the final result to get a conservative estimate (since \( \alpha_p \) is the total excluded baseline probability mass, the excluded values can contribute no more than that to the final p-value). This scales like \( O(\sqrt{n_b}) \) and seems to be plenty quick in practice, and Abba currently uses \( \alpha_p = 10^{-5} \) so the precision is more than good enough.

To compute the conditional p-values, we find the upper and lower success counts for the variation group that would produce just as large of a difference in success rate as that observed, given the (conditional) baseline count. We then compute the tail probabilities using the (binomial) distribution of the variation group's success count (again, relying on the pooled proportion). For small sample sizes (up to 100 trials) we compute the binomial CDF directly; for larger samples we use the normal approximation.

Fork me on GitHub