We run a lot of A/B tests at Thumbtack. Because we run so many A/B tests at such a large scale, we want to make sure we run them correctly. One issue we’ve run into when running A/B tests is that a difference could still exist between the test and control groups by chance -- even if we randomize. This causes uncertainty in our online A/B tests. In these cases, the question we need to answer is: if we observe a difference, is it because of the test feature we just introduced, or because the difference is pre-existing?
We propose that for online designed experiments, a proper randomization procedure enables us to attribute an observed difference to the test feature, instead of to a pre-existing difference in the test groups. Implementing this approach has given us greater confidence in our A/B test results than we had previously. We have drawn from the PhD thesis of Kari Frazer Lock in developing this approach to our A/B tests.
To illustrate the problem, let’s consider an example. Suppose our engineering team has decided to play a friendly game of tug-of-war after eating their favorite superfoods. To test out which kind of food helps people win, we randomly assign them into two groups. Team A channels their inner Popeye and eats spinach salad, and team B decides to chow down on their superfood of choice: kale. Team A wins and claims that spinach induces superior strength. Is that so?
We all have different heights, weights, inherent strength, etc. Suppose that by chance alone, team B members have an average weight that is 15 pounds lighter than team A, and all the folks over 6 feet tall ended up in team A. Plus, team A also got 18 engineers and team B got 17. Adding up all the differences, team A ended up with a disproportional advantage!
Spinach, that wasn't fair!
Variation in Test Units
In online A/B experiments, the test unit is a usually a user or visitor. Users differ greatly among each other: some visit very often, some have faster internet, some use mobile exclusively, etc. When each user interacts repeatedly with our site, we can proactively seek to balance out users based on historical data.
Randomization alone is not enough
Even if we randomize the initial assignment of uses between test and control groups, a difference could still exist between test and control groups that is due not to the test feature but rather due to chance alone. This causes uncertainty in our online A/B tests. If we observe a difference, is it because of the test feature we just introduced, or because the difference is pre-existing?
If we have baseline characteristics for test subjects, we can try to balance the test and control groups on these characteristics before we run the experiment so any observed difference can be attributed to the test feature.
Obviously, this is only feasible when we have some information on the test subjects. When Thumbtack was relatively new and most observed interaction between users and the product came from new users, this step could not be done. Now that Thumbtack gets a lot of repeat visits, we can strive to balance out experiments in a way we weren’t able to do previously and thus get more accurate measures which give us more confidence in our conclusions.
The chance that at least one of the test groups has different baseline characteristics rises as variation increases among test units and as number of samples in each group declines (e.g. when we test multiple variants in a single experiment).
The Solution V1: A/A Tests
One natural solution is to run an A/A test on the test groups before the test feature is introduced. For example if we will roll out a feature in the next month, we can assign the users according to some rule into two groups, and measure their metrics in last month's data. In the month before our test, the test and control groups should have no pre-existing difference, and thus are "A/A" as opposed to “A/B” tests. Using our tug-of-war example, an “A/A” test would be a game, with both teams on the same diet.
Historically this is how randomized experiments are done in the biomedical field. In any published paper on such studies, the very first section is to establish there is no existing imbalance (the famous “table 1"). And if there are any, they can still be accounted for in downstream statistical analysis.
For online A/B tests for any web facing company, running A/A tests is a relatively cheap solution for a step in the right direction. When an A/A test shows a pre-existing imbalance, i.e. the test "fails", we should take caution in interpreting the A/B test result. Depending on the severity of the imbalance we can choose to ignore, statistically adjust, or re-run the experiment.
But there has to be a better way than waiting to see if an “A/A” test fails?
The Solution V2: Repeated Re-randomization
In online A/B tests, test units are usually assigned according to their id. And a random seed ensures each experiment uses a different randomization. Usually a hash function, say, SHA1, takes the seed and user id, and turn that into an integer. Then these integers are split into test groups.
We can repeatedly compute A/A tests results until we have found a split where all A/A tests are flat. This step can greatly reduce the chance of a failed A/A test run on the pre-experiment period.
It turns out to be quite simple in theory! - Randomly select a seed and randomize test subjects by this seed. - Run the A/A test on all metrics of interest. - If the A/A test fails on any dimension, discard and go back to Step 1.
This way we will end up with a seed that can balance test subjects. This procedure can go through anywhere from tens to thousands of seeds before finding a balanced one. The number of seeds you need to go through to find a balanced one depends on how many baseline characteristics we want to balance on and on the amount of variation among subjects.
In theory it is possible for our historical metrics to be correlated so that the repeated procedure can take unreasonably long to find a proper assignment. In practice, we keep an upper bound in number of trials M, and we trace all seeds with corresponding minimum p-values across all the baseline characteristics. If the procedure fails to find an optimal seed that sufficiently balances the treatment variants within M steps to the pre-specified thresholds, we return the best result for a human to judge. This way we guarantee the procedure has a stop point.
What does the human judge do? Based on domain knowledge and business priorities, this human (a data scientist at Thumbtack) can choose either to re-run the procedure, or decide if the best of M results is good enough, or further scrutinize metric computation and selection via offline analysis. Waterproof solution?
But does this procedure guarantee perfect balance in our test and control group every time?
No. It only minimizes imbalance the best we can. Potential reasons include: Randomness. Observations from random variables have this inherent random nature due to endogenous and exogenous reasons.
- Existing users can change their behavior, independent of our test feature.
- Within-company changes, e.g. an ad campaign could start and affect regions in only one of the test groups.
- Another team could start an experiment in the next month that inadvertently and partially overlaps with our experiment.
- A subset of users shows strong seasonal difference, e.g. snow plowing and yard work.
- Externalities, e.g. competitor that targets a certain segment could show up and affect the whole marketplace.
- New users may sign up during the test period. We cannot balance new visitors, we can rely only on randomization and thus imbalance could occur.
The solution we have developed is not waterproof. However, implementing it in our A/B tests has given us greater confidence in our A/B test results than we had previously.
Empirical Results via Simulation
To illustrate, we simulate three metrics X, Y and Z, measured over N users, from a multivariate normal distribution with pre-specified mean and variance-covariance structure, and randomly split them into two groups, and test for difference, i.e. perform “A/A” tests. In the following examples, we assume a total of 50,000 users, two equal sized variants, we simulate 100 rounds and count the number of false positives for each metric.
Case 1: independent normal
When X, Y, and Z are independent, we expect p-values from the “A/A” test to follow a uniform (0,1) distribution. It is then trivial to compute the expected number of false positives when we repeat the simulation K=100 times, i.e. roughly 5 significant in X, 5 in Y, 5 in Z. Indeed, we observe 2 in X, 6 in Y, and 4 in Y.
Delta and its 95% confidence interval clearly shows, after re-randomization, the “A/A” test shows much better balanced groups.
Case 2: independent metrics, one log normal
Of course, rarely are we so lucky to have all normally distributed metrics. So now, let’s change things up by making Y follow a log-normal distribution. Similarly, in terms of false positives before re-randomization, there were 5 in X, 6 in Y and 6 in Z. After re-randomization, everything is well balanced.
Case 3: independent discrete values
What if our metric value is discrete, let’s check by making Z into a discrete variable. There were 6 false positives in X, 3 in Y and 6 in Z.
Case 4: Correlated metrics
Finally, we investigate correlation between metrics X, Y, Z. It is trivial to derive the expected number of false positives, we leave that as an exercise for readers, as well as why it is OK to use z-test in all of the above situations. Here, as an arbitrary choice, X and Y are moderately positively correlated, with correlation coefficient of 0.5, X and Z are mildly positively correlated with coefficient of 0.2, while Z has mildly negative correlation with Y (-0.1). With positively correlated X and Y, they had 11 and 6 false positives each ,while Z had 6.
In all of the cases above, it is clear that such a procedure improves balance, and can help us draw better inference in subsequent A/B tests.
This year, we sent 20 members of the Thumbtack team to PyCon in Montreal. We all had a great time, learned lots, and really made a name for ourselves. By the end of the conference, everyone knew who we were and that Thumbtack enables you to get your personal projects done.
We also had great swag: a comfy t-shirt, sunglasses, and a beer glass. However, unlike most other booths, we didn’t give it away for free. We wanted the PyCon attendees to work for it! For the third year in a row, we created a code challenge that engineers would have to correctly write up in Python to receive anything. At first, submissions slowly trickled in, but by the end of the conference, people were really excited to solve our problem. Some people didn’t even talk to us, just walked to our booth, picked up the challenge sheet, and walked away. In total, we got 87 submissions! And now, the beer our winners drink out of those glasses will taste a little sweeter because it’s flavored with sweet, sweet victory.
When I was little, my family went to our town’s district math night. We came back with a game that we still play as a family. The game is called Inspiration. It’s played with a normal deck of cards, with the picture cards taken out. Everyone gets four cards and one card is turned face up for everyone to see. You then have to mathematically combine your four cards with addition, subtraction, multiplication, and division to get the center card. The person who does it the fastest wins.
This year, our challenge was inspired by Inspiration, no pun intended. The first part asked people to write a Python program that takes in four numbers and determines the mathematical expression that can combine the first three numbers to get the fourth. If they could solve this, they were awarded a t-shirt and sunglasses. The harder challenge was to solve the same problem, but with an arbitrary number of inputs. The number to solve for was always the last number in the string, but the total number of operands was not constant. These solvers won the coveted Thumbtack beer glass.
Hall of Fame
Most of the solutions had some commonalities. They used brute force and they used Python’s built in library itertools to create permutations of the numbers and combinations with replacement of the operators. The following solutions were my favorites:
Greg Toombs had the shortest solution, with only 19 lines of code. You can find Greg on LinkedIn.
Robbie Robinson had one of the cleanest solutions. You can find Robbie on LinkedIn.
Thanks for everyone who submitted a solution! Can’t wait for PyCon next year!
We recently added automatic dependency injection to the PHP codebase powering our website. As we’ve said in the past, dependency injection is a good move for a lot of reasons. It leads to clearer, easier to understand code that is more honest about what it depends upon. Automatic dependency injection reduces boilerplate code to construct objects. And of course, it makes code easier to test.
But it had another benefit we weren't expecting. It made our pages load a few milliseconds faster.
Some of our dependencies are slow to construct because they need to read a settings file or because they use a library that instantiates a ton of objects. Our dependency injection framework allows dependencies to be constructed lazily, which (for most requests) means never constructing them at all. Better, faster code — what's not to love?
Ready to try out dependency injection for yourself? We made our library, ttinjector, public for all to use.
Mark hiking the John Muir Trail
We're excited to introduce Mark Schaaf, the newest member of our team.
Mark is joining us as Thumbtack's VP of Engineering. He previously was a Senior Engineering Director at Google, where he ran the mobile display ads engineering team and later led consumer and merchant payments engineering. Previously, he was the 2nd engineering hire and Senior Director of Engineering at AdMob, which was acquired by Google in 2010.
In his free time Mark tells us that he likes to get outside, hiking and backpacking the Sierra Nevada. Whenever the weather cooperates, he enjoys snowcamping.
There have been countless articles written recently about “Women in Tech”. Yet, when I think back to the month I spent interviewing before deciding to join Thumbtack, I don’t remember being at all concerned about vetting any of those companies for female-friendliness. Luckily, I ended up in a great environment, but I know other women who haven’t been so fortunate. That got me thinking about some questions that perhaps I should have asked upfront to ensure that I was making the right culture fit.
Questions About Work
Ask This: How often do people ask questions? How do people ask questions?
The right answer to this question should be “all the time.” Women, as compared to men, tend to suffer more from impostor syndrome. We believe that our success is a result of luck, timing, or deception rather than our own intelligence or competence. This can make it difficult for us to ask for help, for fear of being discovered to be an imposter. It is easier to ask for help when a supportive and humble culture is already established, where engineers are constantly asking for and receiving help from each other.
Ask This: What practices do you have in place to ensure high quality code and continued learning?
Processes for reviewing code and a culture of continued learning can be additional indicators of humility.
Specifically, look for engineering teams that:
- Pair program: It doesn’t have to be required or happen all the time, but teams whose engineers pair with each other even a couple of times a week are likely to be teams who value collaboration. Because engineering can sometimes be an isolating profession for women, this type of collaborative environment can be great for female engineers.
- Participate in code review: A great follow-up question here is “Why are code reviews valuable to you?” Bonus points go to the company whose engineer responds that not only do code reviews help ensure high quality code in the codebase, but they also create more opportunities for engineers to learn from each other and learn about different parts of the codebase.
- Take online classes together, read, provide an education stipend: A culture where engineers are continually learning can help women rid themselves of imposter syndrome by reminding us that everyone is still learning and no one knows everything.
Additionally, pay attention to the tone with which your interviewer speaks about these topics. If a company encourages pair programming, but your interviewer doesn’t recognize the benefits, this is a red flag.
Questions About People
Ask This: Are there any women on the team? If so, what positions do they hold?
If the answer is no, it isn’t necessarily a red flag. Many teams want to hire more women, but there aren’t enough of us out there. In this case, you could follow up with “Is it important to you to have a diverse team? Why or why not?”
If the answer is yes, however, you have a great opportunity to speak with someone directly about what it’s like to be a female engineer at that company. Ask for their contact information so you can reach out to them if you don’t meet them during your interview. By the way, if you’re interviewing here at Thumbtack, I’d love to meet you!
Ask This: Are any engineers involved in programs aimed at supporting women in the industry? (e.g. PyLadies, Women Who Code, Hackbright, etc.)
I found out about Thumbtack because three of the nine engineers on the team (at the time I was hired) had volunteered at Hackbright, an organization that provides engineering fellowships for women. This indicated to me that Thumbtack cares about hiring more women in engineering roles.
Questions About Culture
Ask This: What kinds of things do team members do together besides work? How central is drinking to social events?
A female friend of mine recently asked this question at a company where she interviewed for a software engineering position. The response? It was something like this:
“We do a lot of things outside of work together. I actually went surfing with one of my coworkers this morning. But if you wanted to find someone to, I don’t know, go shopping with you, I’m sure you could.”
Because, you know, all women love shopping. Such gender-based assumptions would cause me to worry about future assumptions that might be made. Not all answers will give such a clear signal, but any answer should still give you a good feel for the personalities of the people you would be working with.
This question can also suss out how central drinking is to social events. I’m not saying that women don’t like drinking, but team bonding that is centered around drinking can be an indicator of a “brogrammer” culture. Here we brew beer and take mixology classes together, but even when we do those things the focus is not on consuming excessive amounts of alcohol. Rather, we do these things to learn something new, appreciate the drinks, and get to know each other better.
Besides being an indicator for culture, excessive drinking can lead to uncomfortable situations for women as inhibitions are lowered and teammates say or do things they might not have otherwise.
I feel very grateful to have found a workplace that has such a fantastic culture and lacks many of the issues female developers face. Many thanks to the engineers on our team that have worked so hard to build this culture. I hope this post will help women who are currently looking for a job in software engineering, or who might be looking in the future. If you are currently looking, you might also want to check out the interview prep events hosted by Women Who Code, as well as posts like “Self Care Strategies for the Software Engineer Job Search” on the Hackbright Academy blog. Feel free to share some of the resources you use in the comments!
Page 1 / 10 »