At Wealthfront, we’re constantly experimenting with our products to make them better. Like other internet companies, we use a metrics-driven approach to test product changes. Each change is validated with an A/B test (hypothesis test), where we evaluate the change on a subset of clients by measuring its impact on some business metric, like the percentage of users who fund an account with us.
We use Bayesian methods for A/B testing, rather than more traditional frequentist hypothesis tests. The difference in the two approaches is this: in frequentist testing, you start with a model’s hypothesized parameters and ask how well the data fits a model with these parameters; in Bayesian testing, you start with data and ask which model parameters best fit the data. To be slightly more precise, in frequentist testing, you construct a null hypothesis, i.e. no difference between experiment and control, and an alternate hypothesis, that there is a difference. You then ask if the data provides strong enough evidence to reject the null hypothesis in favor of an alternative. In Bayesian inference, you start with a common prior model for experiment and control, and let the data inform posterior models for experiment and control separately. You then use these models to ask if the experiment is better than control at some significance level.
This blog post is not intended to explain Bayesian inference in depth. Instead, it explains what properties of Bayesian inference make it better suited than frequentist methods for our use cases. We also give a quick example of how to apply it to A/B tests with R code. When you’re ready to dive deeper, you can read Doing Bayesian Data Analysis, Bayesian Data Analysis, or The Bayesian Choice. There’s also a good Quora thread on general advantages of Bayesian methods when working with web data.
Problems with Frequentist Testing Solved by Bayesian Testing
One of the big issues with frequentist testing is that frequentist testing can require a large quantity of data to reach a conclusion in some circumstances. For example, suppose you want to compare experiment vs. control for a website change you think will boost the percentage of users who create an account with your website. Suppose you’d be perfectly happy with a 1% lift in this rate. You’d like 95% confidence in rejecting the null hypothesis, and an 80% chance of detecting the lift if it is at least 1%. Let’s assume your baseline conversion rate is 10%. You use a calculator like Evan Miller’s, and determine that this requires 1.5 million samples. If 1,000 new people visit your site every day, it will take nearly four years to evaluate this experiment.
If you raise the minimum detectable lift to 10%, you only need 14 days to evaluate the experiment. If your actual lift were 10%, but you decided upfront you’re willing to hold out for a 1% lift, you’d be stuck waiting years to resolve your experiment. At root, the issue is that “peeking” at a frequentist experiment is not allowed once the sample size is determined. If you could peek and see that the lift was higher, you could resolve the experiment much faster. The cost of waiting for the experiment to resolve could have significant opportunity cost for your business, especially when the difference between experiment and control is large.
Bayesian testing has no constraint on peeking. You can update your model with each new sample. This means you can resolve experiments with large differences quickly; if your experimental variation has -50% lift, then you don’t need to wait until it scares away almost 40,000 visitors to end the experiment. When the difference is small, Bayesian testing still takes a long time, but the cost of running the experiment longer is also small.
In practice, users of frequentist methods often do peek. The problem is that, without realizing it, they are violating the statistical validity of the experiment. Practicing A/B testing this way is only modestly better than using intuition. There is a frequentist method for peeking at fixed intervals during an experiment: raise the significance level required to resolve the experiment. Because of the increased significance level, doing so does not buy as much savings as would be desired. This method is described in Group Sequential Methods in the Design and Analysis of Clinical Trials.
Finally, frequentist methods become hard to manage when we make multiple comparisons. Instead of testing a single experiment against control, we may wish to have multiple variations of an experiment, which we test in parallel. Imagine we want to test a hundred different color variations on a button to figure out which color converts the best. Imagine, for argument’s sake, that all of our users are completely color blind, and can’t see the difference. By pure chance, some of the colors will appear to perform much better than others. This is known as the multiple comparisons problem. As the number of comparisons increases, it becomes more likely to reject null in favor of an alternative at the common choice of 5% significance level. A common way to correct for this in frequentist testing is the Bonferroni correction.
The problem with the Bonferroni correction is that by requiring a lower significance level, the number of samples required goes way up. Let’s reconsider the 10% lift experiment from before that would take 14 days to resolve. If we want to test 10 variations, instead of just 2, we might expect the number of samples required would go up by a factor of five, making the experiment take 70 days to resolve. Instead, it actually takes 140 days, since we now need 99.5% confidence, instead of 95%, to correct for the multiple comparisons problem.
The Bonferroni correction is conservative, and becomes punishing as the number of variations increases. In Bayesian testing, rather than slowing down the experiment, hierarchical modeling is used. While a full explanation of hierarchical modeling is beyond the scope of this blog post, the basic idea is to assume the parameters of each experiment variation are drawn from a shared underlying distribution, whose parameters are estimated as part of the inference. This has the effect of “shrinking” variations closer to the mean without requiring as many samples as the Bonferroni correction would require. You can think of shrinkage like an intelligent correction: avoiding unnecessary correction when the experimental variations have very different results. This is described well in Why We (Usually) Don’t Have to Worry About Multiple Comparisons.
Doing Bayesian A/B Tests
Bayesian A/B testing is easy in R thanks to a package we use called Bayesian First Aid that is a thin wrapper around rjags, a Gibbs Sampling package in R. Gibbs Sampling is a method for generating samples from a distribution. It can be used to generate sequences of observations from Bayesian models. Doing Bayesian Data Analysis provides a good explanation of Gibbs sampling and its usage. Unless you’re doing multiple comparisons testing, you usually don’t need to know about how Gibbs Sampling and rjags work, just which method of Bayesian First Aid to apply.
One of the most common types of A/B tests is to measure differences in conversion rates. Conversion rates are modeled with beta distributions. The function bayes.prop.test lets you quickly compare a variation against control. Say you’ve collected 100 samples from control, and another 100 from your variation. You had 4 conversions on control, and 5 on your variation. You want to know if your variation is a significant winner. The bayes.prop.test method tells us the answer is no:
> bayes.prop.test(c(5, 4), c(100, 100))
Bayesian First Aid proportion test
data: c(5, 4) out of c(100, 100)
number of successes: 5, 4
number of trials: 100, 100
Estimated relative frequency of success [95% credible interval]:
Group 1: 0.056 [0.016, 0.10]
Group 2: 0.046 [0.013, 0.091]
Estimated group difference (Group 1 - Group 2):
0.01 [-0.054, 0.072]
The relative frequency of success is larger for Group 1 by a probability
of 0.626 and larger for Group 2 by a probability of 0.374 .
In particular, you’d want to wait for the (Group 1 – Group 2) difference to exclude zero before concluding your experiment. This measures where 95% of the probability density of the difference between the two rates is. This difference between beta distributions is computed through Gibbs Sampling, since it can only be estimated numerically.
What’s interesting is that we can use this distribution to answer multiple questions about the effect. For instance, we could easily evaluate whether probability Group 1 is better than Group 2 by more than 1%. For an experiment with small effect, this might be useful for breaking a tie.
Bayesian inference allows us to resolve experiments faster than frequentist methods, by detecting meaningful differences in less time. We can test multiple comparisons without using the Bonferroni correction. Finally, because we are estimating model parameters, we can answer multiple questions about the same data without needing to run additional experiments.