From the July 28, 2014 SoMa Tech Talk series.
Abstract: As anyone with A/B testing experience can tell you, the humble A/B test is loaded with complexity and pitfalls. Seemingly basic questions of experimental design and analysis are surprisingly difficult to get a handle on, even for those with a background in statistics. How long should I run my test? Which calculator should I use? What confidence level is appropriate for me? In this talk, I’ll discuss my attempts to use Monte Carlo simulation to put these questions into a very practical context: how do various choices affect your ability to achieve a higher conversion rate when all is said and done? I’ll sprinkle in some interesting statistics and engineering tips along the way.
00:11 Thank you all for coming out tonight. I’m going to talk about understanding A/B testing through Monte Carlo simulation, two of my favorite things. I’m going to do everything in the context of conversion optimization. So, we’re gonna imagine we have some sign up page, like the page we use that new service providers come to on Thumbtack and then they sign up to be service providers, maybe. Or maybe they leave and say, “I don’t trust this random website I found on Google.” And so, we have a page that is getting a steady stream of visitors and each visitor converts, by signing up, or doesn’t convert. And we want to look at the proportion of visitors that sign up and that’s our conversion rate, and we would like to maximize that over time. And we’re going to do so by running a series of experiments, where we come up with a new version of the signup page, or some variation on it, test that against the existing one we have, see which one’s better, then try another variation and so on.
01:00 Hopefully, this is all old news. So, we’ve got two groups, a treatment and a baseline, I’m just establishing some terminology. And I like to borrow terminology from biostatistics, because I think they have the best terminology of any sub field of statistics. So, we’re gonna call it the treatment, we’re treating our visitors to a new version, and the baseline. Each visitor is going to come and we’re going to randomly select them into one of these two groups. One sees the baseline, one sees the treatment page. We’re going to run both of these groups randomizing over the same time period, only look at the data gathered during that time period where you’re running both. And then at some point, we have to say, “Okay, stop the experiment, take the data we’ve got. Make a decision, which one should we keep?” Alright.
01:43 Now, when people start talking about A/B testing, they immediately start talking about distributions and P values, and all that stuff is important to understand how the mechanics work, but I like to take a step back, and think of it as a decision problem. Really, what we have to do is we’ve got this experiment we’re running and we have to decide, “When do I stop this experiment? And when I do stop, which page do I keep? Assuming I’m only going to keep one page.” This is the ultimate question we have to answer, all the other stuff about P values and distributions and null hypothesis feeds into this. Ultimately it’s a decision problem, we’re making decisions.
02:17 Now, first of all, our goal here is to maximize the final conversion rate. Start with a page, use your series of experiments, we end up with some new page that has some new conversion rate, and we want that to be as large as possible. Now, in other context, maybe we care about, “Well, I don’t wanna run this experiment that could be a really bad variation over a lot of visitors, because I’m losing revenue, or I’m losing signups as long as that page is running.” So, there’s sort of this intermediate loss, we’re not gonna worry about any of that. In a biostatistics context, you might have, while you can’t give a really questionable treatment to a bunch of people and kill them, but a website conversion optimization you can, so we’re not gonna worry about any of that. This is our simple goal, the final rate we end up with after all the experiments have run.
03:00 And here are some of the issues we need to watch out for. The first two are bias and confounding, and I’m not going to discuss those here. But if you are running any kind of experiment, you have to think about these a lot. This has more to do with how you set up your test, how you do your sampling, how you do your randomization, and the blocking, etcetera. Your sample size, balance, all these things. I’m not gonna discuss those, but they’re really important. But assuming we’ve got all that taken care of, the third kind of error you can make is an error of chance. For example, you might have a bad treatment, one that’s worse than your baseline. But you choose it because, in your little experiment, it happened to do better, by chance. Or, you might have a good treatment that actually had a higher conversion rate than the baseline in our little experiment. It did worse than baseline, so we chose to keep the baseline and pass on this good treatment.
03:51 So, those are a few kind of erroneous decisions you can make. They correspond, sort of, to Type I and Type II error in hypothesis testing, but those are formally defined concepts that aren’t exactly the same as this, so I’m not using those terms. Now, we can trade these things off against one another, generally, just like Type I and Type II error. So, for example, if our decision process was, “Always take the treatment, don’t even bother running a test.” Well, we’d never have this problem, turning in a good treatment, ’cause we never turned in any treatments, but we’d have a lot this problem potentially. Conversely, our decision process could be, “Just project every treatment, don’t even run it, just keep the baseline forever.” And then you never have this problem, choosing a bad treatment, but you’d have this problem a lot.
04:32 So, you can kind of trade them off against one another, but there is a way, among many ways… There is a way to make both of these problems go away, and that is to run longer and longer tests. That, of course, has its own problem, which is there’s opportunity cost of running very long tests. Assuming we actually have some large wellspring of ideas for tests to run, everyday we spend running one test is a day we don’t get to try another test, to try new ideas and look for that next big winner. So, there’s opportunity cost, and that’s a very, very real thing, in my experience, at least, very real.
05:04 So, the three of these all kind of play off against one another, and you need to balance them carefully. And that’s what we’re gonna talk about. I put this point in here just because you might say, “Well, while we’re running this one test, why can’t we go start running some other test, and just run multiple ones at once?” And you sort of can, but there’s a lot of traps involved there, you have to be careful and I’m not gonna get into that, but it’s not simple, so be careful if you want to do that. Okay, now I’m going to talk about the two different general testing frameworks that I’m going to look at. And the first is this frequentist hypothesis testing, which Evan already talked about a bit. We’re going to talk about it more in the whole decision making experimental design process.
05:47 And the general way it works, without getting into the nitty-gritty, is we pick a significance level and that’s basically saying, if the baseline and treatment are identical, the two pages have the same conversion rate, then we won’t be fooled, we will correctly detect that there’s not a difference… That was a dangerous thing to say. We won’t be fooled into detecting the difference, this percent of the time. So we’ll say, if we don’t detect the difference, we’re gonna keep the baseline and we’ll do that this percentage of the time. Now, I think in watching Evan’s talk, that I remember, the significance level is actually one minus that X there, and if it is, then we’re just talking about Steve significance, which is a new type of significance concept that I just invented. But this is what it is, not really straightforward. That’s our significance level for the next 35 minutes.
06:37 Then we need to choose a power level which says a lot of the treatment actually is better, then we’ll correctly detect that difference and choose it, Y percentage of the time. And the problem is in frequency statistics, you can’t just say, “If the treatment is better.” That just doesn’t work. So instead we have to say, “If the treatment is better by this much.” If it’s 10% better, then we’ll correctly detect that and pick the treatment Y percentage of the time. So that’s like our effect size, so power and effect size, they go hand in hand. One cannot live without the other.
07:10 It’s not how the line goes.
07:12 Alright, so we’ve got these parameters that we choose and then… You can do that to compute a sample size, and that says, “Okay, you need these many visitors bucket.” And then we’re gonna go and wait. That many visitors, once we’ve got them, we take all our data that we observed, plug it into this hypothesis test, which is some black box we’re not gonna talk about or peer inside, it’s gonna spit out this P value thing and we use that to make a decision. We’ll take the treatment or we keep the baseline. So… Quick example, well, we picked 95% significance, according to definition I just gave, but we want 8% power to detect the 10% relative lift, this is our effects size.
07:46 So if we say there are 15% baseline conversion rate based on past observations, and we’ll say, okay we wanna detect a change of at least 1.5% increase in our treatment with 80% power. Plug that all in to some website you found and you get 9,257 visitors in each group. You go and get that many visitors. You would do anything until you get that many visitors. You get that many visitors in each group. Maybe we had these many conversions on the baseline and more conversions than the treatments, so it did better, plug it into this black box called the Chi-squared test, that’s the only one we’ll look at today. There are many other tests that you can use and I’ll speak to that a bit at the end. We get a P value. It’s too large, so we have insufficient evidence to conclude the treatment is better and we decide to keep the baseline. Alright.
08:33 Now there are different ways to do this with Bayesian statistics as an alternative method, and one way was published by this guy, Chris Stucchio on his blog earlier this year, very recent development, cutting edge stuff, leading edge stuff. And in Bayesian statistics, we can actually directly compute this probability that the treatment is better than the baseline. When people first come to statistics that’s always the question they have. “What is the probability that the treatment is better than the baseline?” And in frequentist statistics, you have to spend many hours hammering the lesson to people heads that you can’t talk about the probability that the treatment is better than the baseline because the treatment is just some unknown quantity, the treatment conversion rate and the baseline conversion rate is just an unknown quantity and maybe the treatment’s better or maybe it’s worse. We don’t know, but it either is or isn’t. It’s not random and there’s no probability.
09:22 But in Bayesian statistics, we do this trick, where we redefine what a probability is so it’s not about the proportion of occurrences of some event over many tries, there’s a limit, of course infinity, it’s about our belief in some unknown quantity and that is easier to kind of think about, so we can actually compute a probability that is sometimes better in its posterior, because it’s after we’ve observed the data. This is what I think the treatment might be, this is what I think the baseline might be, and I think this is how much the chance that the treatment might be better.
09:53 So, Chris, he derived a decision rule… Remember, the probability isn’t what they really care about. What we really care about is making decisions. He derived the decision rule, and the decision rule goes like this: We choose a threshold of caring, which is… Says, if the conversion rates of the two groups different by no more than this amount, like no more than 1% or something, then I don’t even care which one I choose. Just go ahead and make a mistake, I don’t care. So we have to pick that. It’s a parameter, just like the significance and the power and those other parameters. It’s kinda nice there’s only one parameter here. That’s cool.
10:26 And then at any point we can go ahead and take the data we’ve observed so far to compute the accepted loss. So we’ve got these posterior distributions, this is actually the “Eye of Sauron” plot that you saw an Evan’s presentation, was taken from Chris’ blog post, I believe, where he derived this and presented this rule. So we have this distribution over the two treatment rates and we look at all the area where, say the treatment is doing better right now, so we kinda wanna pick it. Look at all the area where the treatment is actually worse than the baseline and use that probability density function at how much worse it would be to compute an expected loss, if choosing the treatment is a mistake.
11:02 We get this expected loss, and if it’s less than our threshold of caring, just stop the test, take the treatment. And this is nice because we can do it at any point in time. No matter how many visitors we’ve observed, we can do this. If it says we can stop, we stop, if it says we can’t stop, we can run for another second or minute or day and then do this again. We can do it constantly.
11:23 And the one and only Evan Miller, who we just had the pleasure of hearing from, derived a closed-form solution to this posterior probability for the A/B testing case, which Chris used to derive a closed-form solution to this decision rule, which is great. ‘Cause now we don’t have to do weird numerical computation integration and it’s much easier to compute.
11:43 So quick example to make sure this is all clear and we’re on the same page, a threshold of caring is 1% relative lift, so 10% baseline conversion. Threshold is a 0.1% difference, absolute difference, in conversion rate, we don’t care is it’s less than that. And then maybe we’ve got 10 out of a hundred baselines conversions at some point and 12 out of 100 treatment conversions, so the empirical conversion rate is 12% for treatment and 10% for baseline, but the expected loss if we pick treatment now and we’re wrong, is 1%, too high. So, we keep running the test, even later, we get up to a thousand visitors in each bucket and it’s a 100 to 120. Same empirical conversion rates but now, the expected loss is much smaller because with more data, we have much more certainty, that’s what the Bayesian posterior captures, and so we go ahead and choose the treatment there.
12:34 Alternatively, maybe we get to the point where the two are just neck in neck the whole time. One never divergent… I mean, that’s basically impossible. But maybe they’re neck in neck the whole time and then you get to 100,000 visitors in each bucket, and they both got 10% empirical conversion rate. Well, there’s so much certainty at this point that, even though we have no reason to believe one or the other, whichever one we pick, if we’re wrong, the expected loss is small. It’s about the same value as that last example we talked about. So, we can go ahead and choose either one. So, the point of this is just that this Bayesian test will always end eventually, even if one’s not doing better, it will just decide. It’s not worth thinking about anymore.
13:11 Okay. Those are our two decision strategies and people like to talk about different theoretical properties of these things, like power, and Type I error rate, and things like that. I wanted to put it in a real world context. What is actually the consequences of choosing one or the other rule, or setting this parameter too high or too low? In the real word context is we’re actually running a bunch of experiments and we see where we’d end up. So how will this actually affect where my page, where my business ends up in a year or a month, or however many visitors down the road? So, I use, to test that, I use Monte Carlo simulation.
13:48 So, we imagine that we have a sequence of experiments we’re gonna run, just like we’ve been talking about, on a single page and we imagine a million total visitors. That’s how many total visitors our imaginary experimenter is going to look at in his imaginary world.
14:08 Alright, our imaginary experimenter starts with an imaginary baseline page, which has a conversion rate of 10%. He comes up with a variation page and in our simulation, we randomly draw the true treatment rate, the true conversion rate of that treatment page, somewhere around the baseline page, and that’s unobservable to our imaginary experimenter. He can’t see any of these true conversion rates. All he can see is visitors and whether or not they convert. So we start consuming imaginary visitors and those imaginary visitors… His imaginary visitors randomly convert based on the true conversion rate of the page that they saw, the group that they’re in. So, these are the two levels of random generation in the simulation. One is randomly generating the sequence of true rates of these pages that we’re trying and then two is randomly generating the visitors, their behavior. So, our imaginary experimenter runs this experiment. He gets a bunch of visitors, some of them are converting. He can observe that data, that randomly generated data, and he gets to follow one of these decision procedures that we’ve talked about, in order to decide when to stop the experiment, make a decision to keep one page.
15:11 Now, he’s choosing that page again, based just on the data he’s observed, which is randomly generated not on the true rates. So, he could make a mistake. But whatever he decides, that page’s conversion rate becomes the new baseline conversion rate. And he picks a new treatment rate. So we go back to step two, draw a new rate of conversion rate for the treatment, run another experiment, and so on, until we get to a million total visitors. Whatever we end up at, that’s our final conversion rate, that’s what we care about, the final conversion rate.
15:33 Okay. I had this point in step two about randomly choosing a treatment rate around the baseline and this is kind of what that looks like. If the baseline happens to be 10%, then we’ll draw somewhere between 5% and 15%, kind of based on this curve. You can see it’s centered below the baseline rate, so most treatments are worse than baseline. Which kind of lines up with my general experiments with… Experience with experimentation and hopefully, with yours, too. I mean, if not with yours, then that’s great. You’re awesome.
16:07 And it happens to be that about 20% of the time, the treatment is better the baseline. About 80% of time, it’s worst. And then, it’s always centered on the baseline, so if the baseline is 50%, then we draw somewhere between like 30% and 60% and it looks kind of like this. Now, to be clear, this curve does not reflect any insight into the real world. This is a curve I just picked to run my simulation and it’s stimulating the real world, but who knows how the real world works and it’s different for different people. So, this is just part of the way I parameterized the simulation.
16:42 If you’re interested, the way I did it was a normal distribution of log odds ratios. That is what I pick, the log odds ratio between the treatment and the conversion, which is nice because log odds ratios are differences on this minus infinity, plus infinity scale of log odds. So you just can slap a normal distribution on it. You don’t have to worry about boundary effects or anything. So, it’s more of a mathematical convenience than anything. You could use like a beta distribution or something, but normal distributions are really nice, ’cause you just pick a mean in this spread and it’s all really easy to work with. It doesn’t have any grand theoretical independence.
17:17 Okay. There is one more point which is when I first ran simulations as I just described, there was this problem where it would favor strategies that just ran really fast and loose and ran like hundreds of hundreds of experiments, ’cause the more experiments you could run, the better, just look for the big winners. And that felt really unrealistic to me because there is a cost to running like hundreds of experiments. You have to implement them and come up with ideas and design them. So, I have this fixed cost for every time we come with a new treatment, we just lose 5,000 visitors. And that was the time it took to implement with the new experiment. And that made the results feel a lot more, jive with my intuition more.
17:52 Alright. Now, you know how the simulation was set up. Before we get to the exciting results, I’m going to talk a bit, I’ve got some technical advice that I came across in implementing this. First of all, any time you’re implementing some kind of Monte Carlo simulation, think about your random seeds. You should always set them. Every time you run it, set it to something and then log it every time you run. If you’re setting it to different things, always with your results log, the seed you used. And that was really good because you want your results to be reproducible. So when you’re going through your results and you find this weird, anomalous case that could not have been right, there must have been some corner case bug, you can just set that seed, run again with the debugger on, and catch it. Otherwise, you’re in for a world of pain. Be careful when forking. If you’re doing this in Python, like I did, you’re gonna want to fork so you can use multiple cores to run your simulations four, eight times faster, whatever. And if you just naively do that, your fork, keep this in random seed, and all your cores will be running exactly the same simulation which is pretty useless. So, you know, think about forking. Reset your seeds.
18:55 And finally, I won’t get into this point too much, but if you have a really large program doing really large, complex simulations, you might want to inject a separate seeded random number generation objects into those different parts of your program. That way, if you kind of reorder the way these independent parts of the program work, they won’t affect each other. Otherwise, when you switch them around, now this thing is consuming the random numbers from the generator before this one is. And so, you didn’t actually change how your program worked, you just re-factored it, but your results for the particular seed have changed and you don’t really want that. So you can inject separate random number generator objects into your… Different parts of your program. And second, here’s some Python code, which might have bugs, but it’s mostly right to compute the closed-form, posterior probability that Evan published on this blog, in Python, and it’s a direct translation of his Julia code and… Pretty simple.
19:50 We go over this loop and every iteration with this loop, we do some log datas and exponentials, and add it all to a sum, and that’s fine. And Julia is a really fast language, which is great, if you’re writing in C or Go, or maybe Java, you’re okay. In Python, you’re not. Python is slow. So what we do is we vectorize how you make something like NumPy, that has vectorized operations. R is the same way. So, this is very similar code if you compare them. The logic looks very similar, but there’s no loop. That’s the key. Instead, all the I values that were looping over, we put into an array and then we just operate on that array. And so, we sort of call each operation once and the loop happens implicitly in the underlying C implementation in NumPy. So hopefully, you’ve seen this trick before. It’s really good to do in a language like R or Python that is inherently slow.
20:41 Here’s a little benchmark, and as the loop gets larger it starts to dominate computation time, it gets over 30X faster. That makes a big difference. And you probably want to profile your code and find the hot loops and then vectorize them, versus vectorizing your whole program from top to bottom, because that is likely to lead to very complicated confusing code. Loops are nice ’cause they’re easy to understand in many cases, but slow in your hot loops.
21:08 Alright, now the big charts. You’re on the left side, gonna walk you through it. Don’t worry. The left side are all the different decision procedures. So the top four, we have are the Bayesian decision procedure and these are the different values of that that threshold of caring, all expressed in relative lift. 3% relative lift, 1%, 0.2%, 0.05%. Very small threshold of caring there. Down here on the x-axis is the final rate after a million visitors, and it’s the mean final rate over a thousand simulations. So you run the simulation, there’s a million visitors, and we end up with some final rate. Then we run another simulation of a million visitors, end up with some final rate. And we do a thousand of these simulations. And this is the mean final rate that we ended up at, using that particular decision procedure.
21:53 The green bars are a confidence interval on that mean, which is just to show you, yes, I ran lots of simulations. The purple line here is that base line rate, 10% when we started up. So, if we ended up below the purple line, we were really bad experimenters because our conversion rate went down over time, not up. And you can see, this first one, the Bayesian test with the fairly high 3% relative threshold of caring, it did badly. It ended up way worse than we started. And that’s because it was way too fast and loose. It was accepting results too quickly and this, of course, is a result of the fact also that treatments were usually worse than baseline. I picked it that way, 80% of treatments were worse than baseline. So, if you’re too fast and loose, you’re gonna go down after time. And that’s what happens here.
22:44 By the way, all these slides are online. I’ll put the link at the end so you can come back and peruse them to your heart’s content. In the 1% case, we have a little bit tighter thresholds here, and now we’re doing a little better. Then we drop it to 0.2%, and now actually, we do quite well. We end up out around like 35% or so conversion rate. So, we made pretty good improvements over that million visitors. Once we get too low, we’ll start to see it go back down, and this is the opposite problem now, where we’re running our test for too long. We’re being a little too paranoid about making mistakes and we run our test for too long, and then we don’t get time to run enough tests.
23:25 So, we don’t have enough time to find those little improvements and keep building on them. Okay. So those were the Bayesian tests for four different values of that threshold. And then, here, I’ve got a bunch of different Chi-squared tests. And the first group is at 10% significance, so way lower significance than we would normally use. Then 25%, 50%, 75%, and then 90% corresponding to an alpha of 10%, for those of you who are into Greek letters. And that is a pretty realistic value that a lot of people would use in conversion rate optimization, if not 95%, which is your Ronald-Fisher proved value.
24:00 And then the four lines within each of these significance groups is the relative lift that we use to calculate our power. So here, we’ve got… Well, here, say, we’ve got 50% power… Sorry, 80% power could detect a 50% relevant lift. 80% power means like a 20% relevant, 80% power detects a 10% relative lift. I could have alternatively parameterized these by different amounts of power to detect the same relative lift, and I did this just… You can pick either one. So when we’re only having 80% power to detect a 50% relevant lift, it’s a relatively underpowered test because you only have power to detect a huge 50% jump. Down here, it’s quite fairly, a highly powered test. We have a lot of 80% power to detect a quite small change. So less power, more power.
24:46 And… But we can see… Let’s start down here, 90% significance. This one is still a little too underpowered at 50% relevant lift for our power calculation. And when we go down to detecting a 20% change, we’re doing pretty well there. And then as we go off, we’re getting too overpowered. So again, experiments are being run for too long. And they’re not making mistakes very often, but we just don’t run enough experiments, we don’t have enough time to find all those big winners. And then as we go down on our significance, we’re getting a little bit looser with our significance test, its shape still is generally similar, but as we go up, we find that we can have less power… Sorry, we want more power to make up… Kind of make up for that drop in significance. If you have more power, you can still have good results.
25:34 And as you all see, the best results happened to be here at a 25% significance test, which is way lower than most people would use. But with appropriate power, it balances out. And it balances out because we’re gonna be able to run more tests, these tests are quicker. We’re gonna run more tests and we accept a few mistakes along the way in order to keep running more experiments and find those winners. So, I will show many more plots to illustrate what I’m talking about. I see some confused faces, I’ll see what I can do about that.
26:04 Speaker 2: Six. Four.
26:05 Same 24 different decision procedures you just saw, but now I have the full histogram of final conversion rate results from the thousand simulations. So you can see a little bit of the spread in actual results. Whereas here, we’re looking at confidence intervals on the mean, which are small, here you see the actual spread in simulation results that we ended up with. So if you were one of these imaginary experimenters, you live in one of these imaginary worlds, after a million visitors, you are just one of these points in this histogram.
26:36 And you can generally see the same thing in that… Overpowered here, running test too long, underpowered here, did very poorly, and then right here in the middle, kind of there was a sweet spot. You can see, the overpowered one never got… It didn’t usually end up… It almost never ended up below the baseline ’cause it didn’t make a lot of mistakes, it just didn’t get to go that far out, ’cause it didn’t run enough tests. Whereas this one that was underpowered did end up below the baseline, which is bad. And kind of the same thing on these Chi-squared tests, and we will pick a few to focus in on. We’ll look at this Bayesian one which did well and we’ll look at this Bayesian one, which was a little too fast and loose did not do well, and then we’ll look at this Chi-squared one which had the best mean results, and we’ll look at this Chi-squared one that’s more realistic perimeters of 90% significance, 80% power to detect a 10% relative change and we’ll focus in on this.
27:33 Oh, I’m sorry, totally forgot to explain those colored lines that I’m now taking for granted. The purple line isn’t getting a 10% baseline that was started at, a green line is the mean that we saw in the last slide. You can come back and forth to this chart, or regenerate it yourself using my code on GitHub. Okay, here are those four I pointed out. Bayesian one that did well, Bayesian one that was a little too fast and loose… Whoops, that’s the 1%. The Chi-squared one that did well and the Chi-squared one that was a little overpowered, too slow and conservative. These are the same histograms as you just saw. These two did well, these two did not as well. This one tends to actually fall below the baseline and could stand only at 0%. I mean, not really at zero, but closer to it. This one doesn’t fall below 10%
28:18 Now, yes let’s look at something you haven’t seen yet. Number of experiments run, so this is the total number of experiments we managed to run in our million visitors in each of the simulations. And you can see that the Bayesian approach runs around this many, and then the Bayesian approach that was more fast and loose with a higher threshold for a making a decision, ran more experiments. So it did run more experiments which was nice but it made too many mistakes, as we saw, which is why didn’t complete the results. And Chi-squared test was fairly low significance, ran more experiments than the Bayesian one and that’s where it got its edge. It made more mistakes but it ran more experiments, that’s why it tends to still do well.
28:58 Speaker 3: What’s the Y axis counting for us?
29:01 Count the number of simulations where we had ended up, with these many experiments run. It’s a histogram over the simulations that we ran. So, in 15 simulations, we ended up running 150 experiments for this one. And then 10 simulations, we ran 140 experiments with a million visitors, and so on. I realize these charts are hard to read and you should come back and look at them for an hour on my website or onto GitHub.
29:37 It’s rewarding, trust me. And then this Chi-squared test that was a little too conservative ran way fewer experiments or most simulations, it ran around 34 to 50 experiments, as opposed to over 100 that the other methods ran. So, it didn’t run enough experiments to find enough big winners.
30:00 And one more graph to look at, which I’ll go quickly because it’s not that critical. This takes in the simulation every time we made a mistake. The experimenter doesn’t know that, but every time we made a mistake, we take the difference between the proportion that was better and the proportion that was worse, so that’s like a loss that we made. And this is the total loss of all the errors that we made. And this is interesting because that Bayesian approach gets pretty low loss. It still managed to run a lot of experiments, still manages to run almost as many experiments as that Chi-squared one in blue, but the Bayesian one in red had quite little loss. So it did pretty well which is cool.
30:42 Conversely, the Chi-squared approach that was very conservative had very little loss, I just didn’t run enough experiments. Okay. That one was that, that Bayesian approach, that was our fast and loose friend, too much loss. All right. Let’s look at something maybe a little bit easier to lay your eyes upon. This is from one simulation, one of those thousand simulations, and this is showing the path of conversion rates that we end up with. So let’s just start with that Bayesian approach in red. The coloring is consistent through all these slides, so that red Bayesian approach, that did pretty well. They all start… Yup, Y axis could be better. But they all start near at 10%. And then… So we ran an experiment and then that experiment, we decided to keep it and that was a good experiment so we moved up to this conversion rate ’cause that treatment was better.
31:40 And each of these dots, which I know are hard to see, but essentially it’s an experiment that we run and this is the total number of visitors we’ve seen so far. And so we’re running and we got a win, and we got a win, and we got a win. And then we’re keeping the baseline, so we’re flat for a while, and then we made a mistake and picked the treatment which was actually worse, so our treatment conversion rate went down, and we got another winner, and so on. Hopefully, that makes some sense. So this the path of the conversion rate over the course of this simulation. We can see, for that red line, that Bayesian approach that did well, it making pretty steady climbs, it does pretty well, and ends up pretty high. The dots are pretty close to each other, so it’s running fairly quick experiments. Although there are some examples like right here, there’s a dot there, and then now, another one’s all over there.
32:25 So this whole line here is one experiment that was a hundred thousand visitors, one really long-term. This is the feature of the Bayesian approach. You run really quick experiments when there are decisive results. You run really long experiments when the two are neck and neck. It’s a feature of the book. The blue line here is the Chi-squared test that did pretty well, and it performs similarly. It consistently runs experiments quickly throughout. No experiment ever runs for a long time, ’cause we just do that power calculation, get a sample size, we run for that long. And it does similarly well, although you can see it makes more mistakes. It goes down quite a bit, it goes down quite a few places, but it ends up doing well because it runs lots of experiments, it gets lots of winners.
33:05 This green line here is the Bayesian approach that was too fast and loose, and you can see it goes down a lot and it ends up okay, but it’s just making a lot of mistakes, going down a lot, adapting a lot of treatments that were actually worse, erroneously. And then this purple line on the other end of the spectrum is the Chi-squared approach that was too conservative, and you can see it never makes a mistake, it’s just climbing steadily like the tortoise here. But it doesn’t… The dots, if you can maybe see, or go to the website afterwards and look at this for an hour, the dots are pretty far apart. So it doesn’t run that many experiments.
33:39 All right. Here is that same draft we just looked at, one smaller amount of size and then here nine… Eight more simulations. So you can see the kind of patterns, the red one, the good Bayesian approach always kind of climbs pretty steadily, pretty well. The blue one, that Chi-squared approach that did well also climbs pretty steadily, but makes more mistakes. Then that green one that was too fast and loose, it often ends up below the baseline, it often goes down ’cause it’s making a lot of mistakes. And then the purple one, again, climbs steadily but very slowly, very slow climb, too slow. Yes?
34:12 Speaker 4: So I assume that the shape of everything changes if you change the baseline conversion rate to 10%?
34:18 Yeah, I didn’t mess around with that much. But changing the baseline conversion rate probably would change some things, although I think the relative performance wouldn’t change too dramatically in this particular model. And last plot is showing the paths from all thousand simulations for each decision type. That’s kind of a blurry cloud thing that makes you feel like you need to put on your glasses. So the Bayesian approach tends to end up around here. And not a whole lot of new insight in this graph, but it’s cool. And you can see the Bayesian fast and loose, it often goes down. Sometimes it gets really lucky, even though it’s mostly making quick decisions, it’s just how lucky and does really well, but they’re not usually, pretty risky.
35:03 Okay. Before I sum that up, I need to summarize the caveats that go along with this because there are many and they are important. First of all, all these rules [35:16] ____ the simulation perimeters, like: How do you treat your treatment rate? Why is it fixed around the baseline rate? Maybe as the baseline rate gets better, it gets harder and harder to find the winner. And that usually happens and maybe that would be a better way to setup a simulation. What about the shape? Maybe it should be wider, or it should be further away from the baseline or closer to the baseline. Maybe that implementation costs you 5,000 visitors, where did you get that? Sometimes they’re just one cost, like adopting your treatment involves a risk. You’re changing things for your users, there’s more engineering cost to productionize it.
35:47 And so maybe you want some extra bias for keeping the baseline. And all of these things could and should be explored and I did not explore them. I just basically did one setup for this simulation and then looked at the results from that. A really interesting future project, I think, would be to try varying a lot of these things, run the same work, and see, are some of the system procedures really robust to changes in the setup of the simulation? Did they seem to be successful regardless of the underlying real world setup? Or is no, a no decision procedure really robust and you just have to make guesses about the real world that you’re dealing with and pick the right test? And if so, which tests are appropriate for which situations? That would be cool.
36:27 And the other bit caveat is there are other approaches I didn’t test. On the frequentist test, there’s this thing called sequential testing where you can stop the test short, even in a frequentist setting, there’s different ways you can set up the Bayesian test. This happened to use uninformative prior, for those of you who are into that stuff, you can have a different loss function, otherwise, a precision rule, standard methods, which are cool, but have slightly different goals, and in any case, I didn’t look at them, so one can look at other things. If this really tickles your fancy, then you should totally fork my GitHub Repo and do this, and then send me a pull request, and then I will be your friend. Also, I will be your friend if you don’t do that.
37:08 Alright, to sum up, some cool takeaways, I enjoyed taking away from this project were, one, the significance levels that we often test with like 95% or alpha 0.05, might… Probably too conservative if you’re doing like online optimizing conversion rate, website conversion rate. If you’re like saving people’s lives or testing drugs, then they might be right, and that’s where a lot of the stuff came from, or maybe it came from Ronald Fisher and I don’t even know what he was doing. I didn’t have a chance to read the book or even for his recommended 95%. Anyone read it?
37:49 I’ll have to read it myself now. So, you think about lowering that, maybe a lot, I don’t know, depends how radical you want to be, but think about lowering that. And then, this Bayesian approach which is brand new, it scared me at first when I read about it, because I was like, “Well, what about Type I error rate? What about Type II error rate? I want to control those things” and it doesn’t control those things directly like frequentist testing does, and that was kind of the frame of mind in which I thought about things. But it has this really nice property that it minimizes your loss with shorter tests than like a really conservative frequentist hypothesis test. So, if you really don’t like the idea of having a treatment that’s good and throwing it away or adopting a bad one, that it just like really eats away at you at night, then this is really nice. And that kind of does eat away at me at night, ’cause when I have ideas, the idea of having this great idea I came up with and then throwing it away because of a test, a bad experiment, I hate that.
38:38 And then lastly, these different methods with primitive values that seem reasonable had wildly different outcomes after a million visitors. It made a huge difference. You could end up at like 15% or 35% convergent rate after a million visitors. So, you should think about this stuff if you want your company, or foundation, or personal project, or whatever you’re doing, to be successful. And that is everything. And you can go to that GitHub and find all the code that I used to generate all of that stuff, and there’s a link to the slides, which you can look at for now.
39:18 Yeah, so the question is, in the frequentist method, you pick your significance and your power and your, you know, lift for that power calculation and shouldn’t then the distance always be the same? Are you talking about it’s a little bit wider down here for the blue line say, and a little bit closer up there?
39:34 S?: Yeah.
39:37 That is because the property, this property of the binomial distribution, that as it gets closer to 0.5, or like relative… The variance or the [39:46] ____ relative to the proportion line you’re at gets smaller, so it gets kind of easier to run experiments and detect changes as you get closer to 0.5, so you can kind of run shorter tests with the same… All the other parameters and the significance and the power, everything being the same.
40:01 S?: So, you ____ all achieved that conversion rate?
40:04 What’s that? Yes. So, of course we have to do a power calculation knowing the proportion, and in my simulation, you can’t know the true proportion, so you look at the empirical proportion from the last experiment.
40:17 S?: Ah, okay.
40:17 So, when this particular version won, we took the empirical proportion from that test and we used that in our power calculation henceforth, which is kind of what I had to do in the real world. The corollary to what I just said is that when you’re very close to 0 or a 100% conversion rate… If you’re close to 100% conversion rate, you’re good, but if you’re very close to 0% conversion rate, it gets very, very hard to detect changes, and this guy I know, Evan Miller, has a blog post about that and it’s great.
40:52 So the question is, “How well does the simulation jive with what I have observed in my experience running tests, you know, wanting to cut off a test early, and have got banded methods?” I think banded methods are cool but I’ve never taken the time to actually learn how they work, so I’ve never taken the time to consider actually using them here, and I think this Bayesian approach is really cool, too, and I’d certainly want to think about it more, maybe integrate it into our little proprietary testing our experimentation framework that we have.
41:24 Wanting to stop tests soon, definitely happens. It’s like irresistible a lot of the time, and you tell people not to peek and they do and that’s just not going to change, like the sun rises every day, and people are gonna peek at their test results too early. So some kind of system for dealing with that. It could be sequential testing in a frequentist context, it could be this Bayesian thing, but you really got to have some system for that because you’re not going to be able to stop people. But it is really important to have a system upfront and then be disciplined about it, because the number of times I have seen a test get off to a big lead and then like fall apart, is so large.
42:03 It’s kind of what Evan alluded to at the beginning of his talk that humans want to see patterns. And you crafted this variation and you’ve designed this website and it’s so good. It’s definitely a winner. It’s definitely a winner, I know it in my heart. And then it gets off to an early lead and you’re like, “Great! Let’s call it. I want to put this thing out there for 100% of visitors.” And the number of times I’ve seen that fall apart, by keeping run of the test is just enough that I see the value of this. But a way for dealing with stopping early, I think would be really good. And maybe this idea of running a test longer, like if the two are really close, but you think it’s really important and you want to run the test for a longer, having a way of dealing with that would be nice, too. So I think this Bayesian approach is pretty cool. That answer the question?
42:48 S?: Yeah, yeah. Yes, sir.
42:51 So, the question, I think, is… You’re developing a system to deal with average transaction value…
42:57 S?: Exactly.
42:58 And how would you apply these Bayesian methods? So in this context here, A/B testing on conversion rates is nice because these rates are so simple. It’s just two numbers and this is binomial distribution and all that stuff. Whereas when you’re dealing with something like transaction values where it’s these like… Let’s call it real valued numbers that are coming in, and they could be normal or they could be… Some other distribution. And what are they? Well, its action value would be none, negative. So I don’t know, [43:27] ____ maybe.
43:33 And I don’t know because I really don’t understand Bayesian statistics and I read about this on this blog. I understand that probably just enough to be dangerous in a bad way. So I would go spend some time learning about it if you’re into it and it will pay dividends. Or ask that guy maybe, corner him after this talk. Yeah. This is why it pays to… People always say, “Well, look, if there are these statistical methods published and there’s tools that do it, why do you have to understand the statistics? You just have to know how to use them, use the tools.” And the truth is that the tools apply until they don’t and then you got some case where the tools don’t apply and then you have to actually understand the theories, so you can adapt it to your situation. That’s why you should all go study statistics.
44:20 Yeah. So the question is, so we’ve thought about something like simulated annealing where you start maybe with like a wider acceptance rates and then you narrow it overtime?
44:28 S?: Yeah. Gradually…
44:30 Now, I’m going to say my impression of things that I don’t actually understand, which is a terrible thing to do at a microphone in front of people. But I believe that’s similar to how frequentist testing works… Sorry, sequential frequentist testing. You can start where you say, “I’ll accept… ” Basically if there’s really decisive early results, I’ll accept it. But it has to be a very wide margin. So I’m assuming very low power at that point. And then I’ll run the test longer and then I’ll narrow my bands. You don’t get to narrow it to the point where you end up doing the same test you would’ve originally done because that’s cheating and there’s no way around it. But there’s another blog post I read by some guy and he had this way of having balance at every… It was a frequentist context, I believe, but he had balance at every step that changed over time, so you end up with the same error rates.
45:20 But your decision balance was narrowing over time, I believe. And so there are ways of doing that and I don’t fully understand them. And so the question is, each of these dots represents a new baseline you’ve adopted, there’s a lot of dots. So does that mean like you’re running all these tests and it’s like you only to change colors so many times and you copy on your CTAs so many times, and… What are they actually changing in all these experiments? Great question. I should first stress, just in case anyone’s confused, that this is not real data from real experiments. It’s just a simulation, so I didn’t actually change any CTA or copy, or anything for this graph. But it’s just like this random simulation modeling the real world. And I kind of abstracted away all those changes of copy and everything, which just randomly drawing a treatment rate, based on this thing, randomly drawing a treatment rate.
46:08 And then the corporate’s like, “What was the real effect of all these changes you made?” And we should assume that our imaginary experimenter has this infinite source of ideas to change with the caveat that implementing each change cause 5,000 visitors. Now, running 150 experiments on a single page is a lot and maybe too many to be realistic. I mean, I believe you can get very creative if you really care. And if you’re doing something like copy testing and display ads or something, you can get really creative and test a zillion things. But it is maybe unrealistic to run hundreds of experiments and maybe you want to constrain a simulation where you just have 10 or 20 and that’s realistic, and then see which ones perform better. So that’s again a case where you could vary the parameters of the simulation to be what you believe is more realistic for your context and then see what happens, which you should go check out the fork, wherever you go, and get up and do that.
47:02 S?: Is that typically the way that you do [47:03] ____ 10s to 20s on each…
47:06 Oh, he flows in Thumbtack. He pays Thumbtack and had hundreds of experiments run off them over the years. You really can’t run too many experiments. In a case like this, all the stuff about picking the right decision process and all that, is like small compared to just like keep coming up with ideas and running experiments and testing things. Now there’s also this point, if I change this button of blue and then I change this button to red, which is supposed to green, like these small changes, [A] are going to be limited in how much impact they can have. So if you test like a million different… 10,000 different RGB combinations on your button, one of them is not gonna be here… You’re not going to find a perfect shade of blue that was out there. They’re gonna be correlated with each other.
47:58 So another assumption here is that all those treatments are independent, and they never are because people are always running experiments that are related to each other in this correlation. But these are all valid criticisms of the simulation which is a very simple idealization of the real world. Did you have a [48:14] ____ related to that? Running simultaneous, what about the running simultaneous has to relate to each other? I think if you’re running tests that are just one of five choices, like one of five colors, or one of five versions of copy on this one button looks fine. Make sure you correct for multiple testing in your results… Go read about what that means if you don’t know or go to thumbtack.com/labs/popup for a tool that corrects for it and explains it a little bit. There might be other caveats of multiple testing but that’s out beyond the scope of this talk.
48:48 The question was, do you have any more research that doing [48:52] ____ fundamental level? When I joined Thumbtack four years ago, I really knew very little and I could see that we would need someone to know that stuff. And I got this book called “Modern Introduction to Probability and Statistics” by Dekking, D, E, K, K, I, N, G, et al. And that is a very basic intro, here’s events and distributions and what hypothesis test is. But that was a really good foundation, I thought. It was very accessible for self-study. So you could start there. And then “Statistics for Experimenters” by Box and Hunter, I think, is a really cool book about experimental design, it gets into more advanced stuff you don’t really need for A/B testing but I liked it. And what do you think, Evan?
49:37 Speaker 5: I have some of my old college textbooks, the groups [49:42] ____.
49:43 Okay. He has old college textbooks.
49:45 S5: I mean, I don’t have any strong recommendations here.
49:47 He doesn’t have any strong recommendations here. He didn’t like them. His blog has great posts, the one on A/B testing, you should definitely have read those. Seminal, seminal work.
50:07 No, they are really good.
50:10 S?: You could stare at them for an hour.
50:12 Yeah. No, you could. I mean the language is beautiful. Any other questions? One more? Yeah. The question was there’s been some experiments who were like… They increased the temperature and they found that improved productivity and then they lowered the temperature and then group productivity and then they changed the lighting. And eventually, they figured out it was just that making changes improved productivity for people that wasn’t finding the actual perimeter values, and had I observed, is it better… Running like a med experiment to see if that’s true.
50:48 That is a good point and it gets into all these underlying issues with A/B testing that you’re just operating on assumptions that may not be true. A lot of the experiments I’ve run had been on… Marketing, really, the ones, like AdWords and stuff, you’re getting a lot of exposure to new people all the time, so you can kind of ignore that, which is really nice. And then there’s experiments we’ve run over our service providers, which is like… It sounds bad when I say we’ve run experiments over our service providers, but on pages that our service providers use and that’s more of a fixed population. And I can certainly say I’ve seen this effect where, when you make a fairly dramatic change to its design or something, there’s like this initial dip and then it kinda comes back to some maybe better level because when you have a community of users who are accustomed to an interface, you just change it and they’re like, “Oh, what do I do? I don’t know where the button I used to use went.” And then they kinda get used to it.
51:40 And so if you’re testing over a population of repeat users, long-term repeat users, you certainly have to take that effect into account, you think it may be happening. It’s very context-specific. I haven’t done any like med experiments like trying to changing things back and forth to see if the change itself has an effect. That would be cool. I don’t usually have the luxury of time to do those things, I guess. It would be cool. Okay. Thank you so much to everyone for coming out tonight.