When I arrived at Thumbtack at the beginning of the summer, fresh off my first year at Stanford, I was mostly thankful that I wouldn’t be lounging in my parents’ basement for the next three months. A far-too-long internship search that began over winter break and dragged into late April had landed me at a company that, mere months ago, I had never even heard of. With this being my first internship and all, I wasn’t sure what to expect.
Fortunately for me, my uncertainty didn’t prevent Thumbtack from providing a lovely experience, often surprising me in the process. At the risk of repeating the wonderful things about our culture and work that Lily, Shyam, and Brandon have already pointed out, I’d like to share the ways in which Thumbtack gave me the internship I didn’t know I wanted. And that began with a...
Focus on Learning
It’s reflected in the stacks of books lying around the office and on people’s desks. It’s reflected in the stipend we get to attend conferences, seminars, and industry events. It’s reflected in the regular brownbag presentations that dig deep into the work various teams have been doing.
Wherever you look, Thumbtack exudes a focus on learning and a desire to quench intellectual thirst. This was made clear to me during my on-site interview, when I was told that my main job for the summer would be to learn and improve myself. As a student eager to absorb new knowledge and skills, I found this message particularly encouraging.
For me, learning came in two forms. First, there was learning...
From the Engineering Perspective
I arrived at Thumbtack just in time to participate in Fix-it Week, in which the engineering team collectively works on small tasks that likely wouldn't get prioritized in the normal workflow. I used Fix-it Week as an opportunity to familiarize myself with the website, pushing out several fixes that touched different pieces of the codebase. Most of our stack is composed of languages and frameworks I’d never worked with before, so to help me get the job done, I was encouraged to pair with multiple other engineers, many of whom weren’t on my team. Working with others not only increased my productivity, but also exposed me to many of the cool tools and helpful tricks of the trade. With their help, my code looked prettier and simpler and ran faster, and my skills as an engineer were better off for it.
Fix-it Week began a trend that continued throughout my internship. Even though my main project was to write a backend service in Go, I also had many chances to do things like running migrations on our database, modifying Angular forms, setting up A/B tests, and querying our big data clusters. Throughout my endeavors, I was often supported by other engineers who wanted to help me learn. The fact that I was encouraged to explore and tinker with parts of the codebase I wasn’t necessarily supposed to work on was something I greatly appreciated.
I loved spending my time working on challenging engineering problems, but I was also curious about what was happening outside of the engineering team. That’s why I also valued the chances I had to understand Thumbtack...
From the Business Perspective
Learning about the big-picture business probably isn’t something Thumbtack set out to have me do, but it’s something I was given plenty of opportunities to do anyway. The biggest factor was Thumbtack University, a two-day bootcamp where new team members get a comprehensive overview of the company, including things like marketing, branding, SEO strategy, technical and people infrastructure, how the business stands in the marketplace and what the future holds. Most of what Thumbtack University covered was pieces of the business which I would never work on myself, but I nevertheless appreciated the chance to learn.
Thumbtack University was just one example of how Thumbtack values communication and transparency. Team meeting notes are shared with everyone, as are slide decks for board meetings and all revenue metrics. Brownbags and weekly presentations were excellent opportunities to learn about the different ways in which Thumbtack runs successfully. This greater understanding of everything that’s happening around me is one of the reasons I felt so much like...
A Part of the Team
Officially, I was a member of the Growth team at Thumbtack. I worked with my mentor, Yunliang, and the rest of the team to increase traffic to our site and the number of requests made. One might also say I was a member of the larger engineering team and the Thumbtack team as a whole. But no matter which team I may be referring to, I always felt solidly like a part of it.
From day one, I did the same things everyone else on the team did. I went to meetings with the team, where I had a say like anyone else. I participated in team standups and demos and gatherings. I ate meals with the team, and I went to off-sites and bonding events with the team.
I was also given the same privileges and responsibilities as the rest of the team. I had the same permissions to admin as everyone else, and I could deploy to production like everyone else. I did code reviews and participated in Fix-it Days like everyone else.
Being part of a team and not just a twelve-week-long appendage felt empowering and gave me the confidence to take...
Ownership of my Work
I was encouraged, both explicitly and implicitly, to take charge of the work I did. This included all parts of the development lifecycle — designing the product, writing the code, creating unit tests, debugging on my test instance, deploying and monitoring the metrics, and quickly pushing out any fixes to be made. Naturally, I needed and received a lot of help from others in getting everything done, but I had to seek that help out myself.
Twice during my internship — once in the middle and once at the end — I presented my work at the Friday afternoon company-wide gatherings. These presentations gave me the chance both to share my work with the rest of the company and to reflect on and take pride in the things I had created. By taking ownership, I was able to gain a sense of accomplishment in the many tangible things I shipped throughout the summer, and I think that sense of accomplishment is a part of what makes engineering so fun and rewarding.
In the end, my time at Thumbtack was not only a great experience, but also made me realize the qualities that I value in an internship. Thumbtack exemplifies the kind of environment I’ll be looking for in any of my future endeavors. If Thumbtack sounds interesting to you, reach out. We’d love to have you!
On the first week of Thumbtack...
The office—I wouldn’t really call Thumbtack my “true love” (yet)—gave me a promise of exciting times to come. The week rushed by with on-boarding meetings and processes (more exciting that you might expect, especially with the words “Friday massages”); an All-Hands meeting, in which all engineers come together and ideas fly around like bullets; many delectable Thumbtack lunches and dinners; and a team bonding at the beach—a San Franciscan beach, i.e. a cold, bleak streak of sand at the edge of land, made enjoyable from the warmth of the surrounding company (pardon the pun). And that wasn’t all: I was internalizing the company goals and values, cramming 5+ years of team history and infrastructure into my brain, picking up Go, and spending hours of excitement with my mentor, Alex, in preparation for the launch of Kiki’s (email) delivery service.
On the second week of Thumbtack…
Alex asked me, “do you want to present at All-Thumbs tomorrow?” All-Thumbs is a company-wide gathering in which any team or team member can present… in front of everyone. So on my 9th day, I’m holding that microphone, standing in the spotlight, and trying to explain how Kiki will replace our current emailing setup on the website while simultaneously wondering whether the front row audience can hear the pounding of my heart.
Kiki had made good headway: we had spun up several million goroutines to track usage stats, written a preliminary version of the design doc and code, and had the environment all set up and running on AWS. On top of Kiki, I also talked about my work on a few Go packages that all our services now use. In fact, watching these packages
imported by services going into production terrified me more than presenting at All-Thumbs: what if I had made a mistake? I could compromise the website’s security or crash our services, even after extensive code review (luckily, neither occurred). Only two weeks in, Thumbtack had thrown me challenges and projects with impacts beyond what I had ever experienced within the safety of school walls.
On the third week of Thumbtack…
I went back to school. I didn't even have to apply this time—I was automatically enrolled in Thumbtack University (my dream school, of course). Suffice to say, school and lecture are the same no matter where you go... so perhaps 20 hours of “class” didn't make the highlight of my week. I have to admit, though—the material was fascinating, and I came out amazed at how well the internal components to Thumbtack coordinated and worked effortlessly together. This rapid and intense onboarding procedure, while not as fun as my work with Kiki, clarified and made Thumbtack's mission concrete. So although I was uninspired to attend 20 hours of lecture, I gained more drive to help achieve Thumbtack's ultimate goals. I finally felt that I was a part of Thumbtack.
On the fourth week of Thumbtack...
Make Week kicked off! Make Week is a week in which the entire company emerges from whatever projects are currently underway to pursue their ideas on new features and product improvements that would otherwise be abandoned for more pressing issues. I had heard of companies holding hackathons or hack weeks, so I asked Marco at lunch (casual lunch with the CEO, no big deal) why Thumbtack adamantly stuck to “make” instead of “hack.” His response: “hack” is typically connected with engineering, and Make Week is a week intended for everyone—engineering, marketing, design, and more—to stretch their minds and innovate ways for the company to evolve. To imply that only engineers should participate would be to lose the valuable minds of 2/3 of the company. I found this to be yet another example of the enormous effort at Thumbtack to encourage transparency and communication across the company: no one team ever functions completely separate from another, and transparency works as the oil to keep the internal mechanics of Thumbtack running smoothly.
I took a break from Kiki this week and had the opportunity to pair with several different engineers to work on several Make Week projects of my own, including an internal server for Godocs, a script to automate setting up environments and applications, and a few more Go packages for our services. Finally, I again presented at All-Thumbs, albeit feeling slightly calmer this time. Make Week had been a time of exploration, and I had worked on several projects that would become increasingly important and useful during my internship.
On the fifth week of Thumbtack...
Where there are problems... there are Fix-Its and Thumbtack engineers. It was Fix-It week, the week in which Eng and Product tackled issues previously set aside for more urgent projects over the course of the last term. This included strategizing for the future to prevent predicted problems from ever being born. Although I was involved in planning for my team, AWS, whose general purpose is to migrate all servers to AWS, my involvement with Kiki and our Go skeleton also placed me in ArCo (Architecture Committee), which was to oversee our move to SOA (service-oriented-architecture). Sitting in our meetings, I was struck by the importance of this committee—we were essentially proposing a standard on how and why services could and should be created for years to come. And when Alex mentioned that Kiki would set the example for “best service practices”... well, no pressure. While I lacked the experience of many of the engineers in the room, it was strangely easy to voice my ideas and questions (everything was taken seriously, no matter how naïve I thought I sounded), which made me appreciate the openness of my teammates to new viewpoints and ideas. So on top of creating Kiki, I began documenting the steps to service creation and started a checklist of essential service elements. One of my biggest regrets is that we probably won't be able to complete this monumental task before this internship is over, and I wish I could stay to see the ultimate outcome.
On the sixth week of Thumbtack...
Things went wild. It was the week of the third quarter kickoff—an entire day of celebrating what we had achieved the previous quarter, and of gearing ourselves up to achieve our goals for the quarter ahead.
The presentations foretold of the exciting, but simultaneously intimidating, months ahead. To conclude the quarter, we all headed over to a carnival-themed gathering, complete with aerial silk dancers, fortune tellers, donut burgers (I highly recommend, they were delicious), face painting, and much more. But the excitement didn't stop when we returned home at midnight...
Like most of us, I enjoy reading my phone notifications when I wake up in the morning. Unless it's an email at 7 A.M. from Mark (VP of engineering) reporting:
EMERGENCY: emails queued up, 0 sent in 2 hours.
Guess who was in charge of the emailing services? Yup... I had bloodied my hands with my first emergency. Luckily, the issue was resolved quickly—thankfully before Alex woke up—and with nearly zero impact, but I had learned my lesson well. At Thumbtack, we have a system of postmortems—every emergency is “owned” by one or a few people, and analyzed for what can be done to avoid anything similar in the future. I felt strangely proud to own my first postmortem; I had known before that a careless action could bring the system down, but the reality of it hit me hard. The postmortem now serves as my permanent reminder, like a burn scar that heralds back to a childhood memory of curious fingers playing with fire.
On the seventh week of Thumbtack...
Go Gophers! This week, I flew out with 5 other engineers to Denver, Colorado for the second-ever Gophercon, a conference dedicated to the Go programming language. I won't go into much detail about the 20+ fantastic talks, but I did document all my learnings in this Go wiki, which will hopefully prove useful in the future! I also somehow acquired a ticket to the sold-out Go workshop track, which were “deep dives” into some of the more advanced (and really cool) features of Go.
Besides the conference's enlightening and intellectual lectures, I got a taste of the Go community, and more broadly, the “coder community” at large. It struck me as a surprise that I was the only woman in a room of 100+ men during the workshops, and it took significant effort to find another woman in the conference of 1500 attendees (many of whom had long hair). Although the gender balance in engineering isn't quite 50-50 at Thumbtack or at school, I had never before experienced the notorious gender disparities and stereotypes as I did now, such as the automatic assumption that I was attending the “Go Bootcamp,” meant for beginners of Go, rather than the workshops intended for more experienced participants. It didn't make a difference in my learning experience, but admittedly there were some awkward moments, like when a bunch of guys refused to walk through a doorway until I had passed through. The conference had also amusingly assumed that that everyone attending Gophercon was above 21—after talking to the organizers about the drink tickets I had received for the after-party at the brewery, I was reassured the next conferences would be more minor-friendly.
Nevertheless, these small bumps during the trip did nothing to lessen my enjoyment of the conference and of Denver in general. My appreciation of Go definitely increased (one engineer jokingly called me a “go-fangirl”), and I spent my free time sampling some of the famous foods, museums, and historical districts of Denver.
On the eighth week of Thumbtack...
Kiki blasted off, sending more than 100 emails a second. We deployed Kiki to send all emails from all the engineers' test version of the website, as well as all the staging emails used in our second, Salt Lake City office. The usual testing process for new services includes unit tests, integration/end-to-end tests, load tests, and finally an A/B test to ensure that Kiki can handle everything that could possibly go wrong. Thus, we attacked Kiki with 7x more email requests than we currently handle, simulated network failures, and manually triggered panics, all through which (to my surprise) Kiki came out relatively unscathed. While not much new code emerged from the process, Kiki become more refined and robust, nearing its production birthday.
As exciting as email-sending was, more overwhelming were the 58 reporters and media frenzy that greeted me Thursday morning, causing traffic jams on 9th street as Jeb Bush's Uber driver meandered over to Thumbtack HQ (see more on his visit here). It was simultaneously terrifying and reassuring to meet one of the potentially most powerful people in the world—and to realize that even figureheads are human.
On the ninth week of Thumbtack...
Kiki hit a bit of a road-bump. Remember how I was on the architecture team? Well, one of the most argued points had been our way to ensure data persistence in the face of network failure or unpredictable hardware failures. Kiki currently had a simple system with goroutines and file storage for saving email data—for every request, Kiki would spawn a new goroutine, write data to file, send the email, and then delete the data, ensuring that, save for very rare occasions, unsent emails would not be completely lost. On top of email sending, we ran a “cron job” type goroutine that pulled unsent data from the file system and resent emails on a set schedule. Potentially the worst feeling in my internship so far was checking this code into a new branch, heading back to master... and deleting it all. The architecture team had agreed (myself slightly reluctantly) upon a data persistence system involving queues, a conditional write check to avoid duplicity, and a two-tier environment setup. Such a system could be implemented once in our shared library, tested extensively, and then be used by all. To have all our services following the same patterns would make designing new services and debugging existing ones more efficient, not to mention that long debates per service over that services' particular design patterns would be avoided on this topic. With this pattern, essentially all emails would be first placed in the queue, after which a “worker tier” application would pull from the queue and send the email, using a conditional write to ensure that the email had not yet been sent. The second tier, a “web tier” application, would expose an API to the outside world, allowing our application to respond to more than just requests from the queue, e.g. requests to unsubscribe emails marked as bounces.
Starting over, although I understood the need to do so, was slightly disappointing—Kiki had been so close, and now productionization felt weeks away (weeks I didn't have). However, I now had the opportunity to abstract Kiki to be used for other notification services (SMS and push notifications), as well as organize Kiki's code into a more understandable package setup. I was more than determined to get Kiki back to production-ready by the end of the week, and was able to succeed, implementing the new data persistence design pattern, pulling out Jiji, Kiki's new webserver sidekick, into a new web tier environment, and reconfiguring our deployment scripts to work with dual application tiers. I ended a tiring, frustrating, but ultimately rewarding week with a session of rooftop yoga led by Jeremy—an upside-down San Francisco sunrise had never looked so good.
On the tenth week of Thumbtack...
We dark launched Kiki into production, essentially running Kiki in parallel to the current email-sending system so that Kiki could practice sending emails in production without actually sending them to their designated recipients. After the previous week of chaos and non-stop coding to get the refactored Kiki back into shape, I felt like I could stare at Kiki's metrics and dashboard forever and never get bored—it was unbelievable to watch those numbers tick every second and realize that production was actually happening.
And now was time to experiment! We ran profiling tools to figure out where CPU was used most and tweaked Kiki to perform even better, testing with different network resources, memory resources, and machine models (surprisingly, we were CPU-bound, mainly due to the context switching of goroutines) Kiki was becoming polished—I now could settle down and tidy up loose ends, making Kiki as perfect as possible. I also integrated our push notification system (Lakitu) with Kiki, getting a taste of the mobile team's work and collaborating with their team members (I was tempted for a moment to leave AWS for mobile to obtain one of Thumbtack's iPhones... but decided to remain loyal to Android). It was an incredible feeling to realize that what started as a relatively small summer project—an emailing service—had transformed into something much bigger: a service that would set the standards for all services to come, and that would handle more than 5x the number of requests than initially planned. Although push notifications were not yet integrated with the website, I had finished with 2/3 of Kiki's final product—all the extra hours of work had definitely been worth it. For emails, what remained was to do an online A/B test with Kiki, to ensure that Kiki worked equally as well as the current script attached to the website. The week ended with a much needed break—an AWS team celebration of the past quarter's work, complete with a 14-dish, family-style dinner.
On the eleventh week of Thumbtack...
As exciting as spinning up new services and scripting new deployment features had been, this week was a time to visit the past. This meant plowing through fifteen code reviews and modifying seven of our code repos, including those of services that had been untouched for over half a year. We wanted to bring all our older services, such as Hercule, up to the standards by which Kiki now abides. Of course, before doing this, we had to first decide on Kiki itself: should we use flags or environment variables? Should something like ports be configurable to ensure future portability (pardon the pun)? Should we alter the code for readability and clarity, or keep it concise and add documentation instead? How should we track metrics and alert on errors? It was slightly exhausting to code a change, decide that we should remove it, and then decide later to change it back to the original. One thing I learned: if you put a group of highly knowledgeable (and opinionated) engineers in a room and debate a controversial decision, discussions can linger on forever—it's nearly impossible to find a solution that satisfies everyone. Sitting in those committees reminded me of debate tournaments—just when I found myself convinced by one point, someone would highlight the torrent of problems that came with it. The often heated back-and-forths definitely never got boring.
After working on all our services, I also cleaned up our deployment and service resource creation scripts and demoed these for our Engineering and Product teams! Every other week, we have "deep dives" into current projects underway or new procedures/tools that all engineers should know and use. My scripts fit into the latter category, and it was truly awesome to see other engineers across all teams using them—what started as a (slightly selfish) Make-week project to simplify my work setting up AWS environments and enforcing environment standards had turned into a productionized product that reduced what took hours to do into minutes, allowing any engineer to create AWS resources without double checking against Thumbtack's standardized setup configuration or asking someone on the AWS/infrastructure team. Immediately after the demo, I dove into my last, 20-minute All-Thumbs presentation, summarizing my work with Kiki, on the architecture team, and on our deployment scripts! Unlike my first, 3-minute presentation, the nerves had disappeared—and I left the podium (literally) dropping the mic.
On the twelfth week of Thumbtack...
I went off into the land of Big Data, learning how to use Hadoop, Hive, Spark, and more to extract data to analyze for Kiki's A/B test. We started off with 1% of production traffic sent to Kiki, then once metrics from emails sent from Kiki showed no significant deviation from baseline metrics, moved to 10% traffic, then 50%, and eventually to 100%! I paired with some of our data scientists on the Data Platform team—it was like stepping off into another world, leaving AWS to encounter the world of SQL and Hadoop clusters. With this final pairing, I realized I had actually paired with people on every single engineering team during my time here, either helping them with AWS or receiving help myself—I had integrated mobile push notifications into Kiki, dealt a little with our data platform, worked with the matching service, and code-reviewed some of our growth services. But while I had at least skimmed the surface of most of our engineering team and code-base, I had yet to work with designers or product managers—I guess working on our infrastructure and back-end services had to have some cons (although I didn't really mind at all).
I also had the chance to experience first-hand the impact of Thumbtack's product. One of the aspects I appreciate most about carsharing services is the conversation—not only do I get a ride, but I get the chance to have a nice chat along the way! During one of my excursions, I discovered my driver's dream was to open his own restaurant—and he was providing catering services to fundraise and bring his culinary skills out into the public. And so, of course, I brought up Thumbtack! Right before we parted, after I said that I hoped that Thumbtack will help him achieve his goals, he commented, "You must really love your job—I can hear it in the way you talk about it, it's really genuine." He couldn't have been more right. (I've checked back with him, and he's received 100 requests in the four days since he's signed up!) I also signed up as a pro myself for chamber music performance, and got my first hire this week!
After the launch of Kiki, things rapidly came to a close; we ended our internships with a slew of farewell dinners with Marco (CEO), Mark (VP Eng), and the team, which had come to feel like a second family. As always, the culinary team outdid themselves (dining hall food will definitely pale in comparison to this summer's meals, which included "farro risotto, lamb porterhouse, seared NY strip, zucchini a la plancha, blue cheese salad", and good old "buttermilk country fried chicken"). Although we were all sad to leave, the nights were full of our stories of our adventures, blunders, and most importantly, unforgettable learnings from this incredible summer. As we head back to school, we go armed with a new set of tools and experiences, and the knowledge that we made a difference in someone's life this summer.
A little about me—I'm currently a rising junior at Harvard studying computer science and mathematics. To be honest, I can't remember when I first heard about Thumbtack, but I do recall my first intense interview with Alex and encountering the genuine passion of the team to solve the many challenges Thumbtack faces. It's been an unforgettable experience, and I wish I could partake in Thumbtack's bright future ahead. Although my time at Thumbtack is over for now... perhaps yours is just about to start.
 Jiji is Kiki's anthropomorphy pet talking cat and closest companion. (Coincidentally, Jiji has a girlfriend, a white cat named Lily)
 Lakitus are Koopas who ride clouds through the skies, mostly dropping Spiny Eggs on Mario or Luigi.
We run a lot of A/B tests at Thumbtack. Because we run so many A/B tests at such a large scale, we want to make sure we run them correctly. One issue we’ve run into when running A/B tests is that a difference could still exist between the test and control groups by chance -- even if we randomize. This causes uncertainty in our online A/B tests. In these cases, the question we need to answer is: if we observe a difference, is it because of the test feature we just introduced, or because the difference is pre-existing?
We propose that for online designed experiments, a proper randomization procedure enables us to attribute an observed difference to the test feature, instead of to a pre-existing difference in the test groups. Implementing this approach has given us greater confidence in our A/B test results than we had previously. We have drawn from the PhD thesis of Kari Frazer Lock in developing this approach to our A/B tests.
To illustrate the problem, let’s consider an example. Suppose our engineering team has decided to play a friendly game of tug-of-war after eating their favorite superfoods. To test out which kind of food helps people win, we randomly assign them into two groups. Team A channels their inner Popeye and eats spinach salad, and team B decides to chow down on their superfood of choice: kale. Team A wins and claims that spinach induces superior strength. Is that so?
We all have different heights, weights, inherent strength, etc. Suppose that by chance alone, team B members have an average weight that is 15 pounds lighter than team A, and all the folks over 6 feet tall ended up in team A. Plus, team A also got 18 engineers and team B got 17. Adding up all the differences, team A ended up with a disproportional advantage!
Spinach, that wasn't fair!
Variation in Test Units
In online A/B experiments, the test unit is a usually a user or visitor. Users differ greatly among each other: some visit very often, some have faster internet, some use mobile exclusively, etc. When each user interacts repeatedly with our site, we can proactively seek to balance out users based on historical data.
Randomization alone is not enough
Even if we randomize the initial assignment of uses between test and control groups, a difference could still exist between test and control groups that is due not to the test feature but rather due to chance alone. This causes uncertainty in our online A/B tests. If we observe a difference, is it because of the test feature we just introduced, or because the difference is pre-existing?
If we have baseline characteristics for test subjects, we can try to balance the test and control groups on these characteristics before we run the experiment so any observed difference can be attributed to the test feature.
Obviously, this is only feasible when we have some information on the test subjects. When Thumbtack was relatively new and most observed interaction between users and the product came from new users, this step could not be done. Now that Thumbtack gets a lot of repeat visits, we can strive to balance out experiments in a way we weren’t able to do previously and thus get more accurate measures which give us more confidence in our conclusions.
The chance that at least one of the test groups has different baseline characteristics rises as variation increases among test units and as number of samples in each group declines (e.g. when we test multiple variants in a single experiment).
The Solution V1: A/A Tests
One natural solution is to run an A/A test on the test groups before the test feature is introduced. For example if we will roll out a feature in the next month, we can assign the users according to some rule into two groups, and measure their metrics in last month's data. In the month before our test, the test and control groups should have no pre-existing difference, and thus are "A/A" as opposed to “A/B” tests. Using our tug-of-war example, an “A/A” test would be a game, with both teams on the same diet.
Historically this is how randomized experiments are done in the biomedical field. In any published paper on such studies, the very first section is to establish there is no existing imbalance (the famous “table 1"). And if there are any, they can still be accounted for in downstream statistical analysis.
For online A/B tests for any web facing company, running A/A tests is a relatively cheap solution for a step in the right direction. When an A/A test shows a pre-existing imbalance, i.e. the test "fails", we should take caution in interpreting the A/B test result. Depending on the severity of the imbalance we can choose to ignore, statistically adjust, or re-run the experiment.
But there has to be a better way than waiting to see if an “A/A” test fails?
The Solution V2: Repeated Re-randomization
In online A/B tests, test units are usually assigned according to their id. And a random seed ensures each experiment uses a different randomization. Usually a hash function, say, SHA1, takes the seed and user id, and turn that into an integer. Then these integers are split into test groups.
We can repeatedly compute A/A tests results until we have found a split where all A/A tests are flat. This step can greatly reduce the chance of a failed A/A test run on the pre-experiment period.
It turns out to be quite simple in theory! - Randomly select a seed and randomize test subjects by this seed. - Run the A/A test on all metrics of interest. - If the A/A test fails on any dimension, discard and go back to Step 1.
This way we will end up with a seed that can balance test subjects. This procedure can go through anywhere from tens to thousands of seeds before finding a balanced one. The number of seeds you need to go through to find a balanced one depends on how many baseline characteristics we want to balance on and on the amount of variation among subjects.
In theory it is possible for our historical metrics to be correlated so that the repeated procedure can take unreasonably long to find a proper assignment. In practice, we keep an upper bound in number of trials M, and we trace all seeds with corresponding minimum p-values across all the baseline characteristics. If the procedure fails to find an optimal seed that sufficiently balances the treatment variants within M steps to the pre-specified thresholds, we return the best result for a human to judge. This way we guarantee the procedure has a stop point.
What does the human judge do? Based on domain knowledge and business priorities, this human (a data scientist at Thumbtack) can choose either to re-run the procedure, or decide if the best of M results is good enough, or further scrutinize metric computation and selection via offline analysis. Waterproof solution?
But does this procedure guarantee perfect balance in our test and control group every time?
No. It only minimizes imbalance the best we can. Potential reasons include: Randomness. Observations from random variables have this inherent random nature due to endogenous and exogenous reasons.
- Existing users can change their behavior, independent of our test feature.
- Within-company changes, e.g. an ad campaign could start and affect regions in only one of the test groups.
- Another team could start an experiment in the next month that inadvertently and partially overlaps with our experiment.
- A subset of users shows strong seasonal difference, e.g. snow plowing and yard work.
- Externalities, e.g. competitor that targets a certain segment could show up and affect the whole marketplace.
- New users may sign up during the test period. We cannot balance new visitors, we can rely only on randomization and thus imbalance could occur.
The solution we have developed is not waterproof. However, implementing it in our A/B tests has given us greater confidence in our A/B test results than we had previously.
Empirical Results via Simulation
To illustrate, we simulate three metrics X, Y and Z, measured over N users, from a multivariate normal distribution with pre-specified mean and variance-covariance structure, and randomly split them into two groups, and test for difference, i.e. perform “A/A” tests. In the following examples, we assume a total of 50,000 users, two equal sized variants, we simulate 100 rounds and count the number of false positives for each metric.
Case 1: independent normal
When X, Y, and Z are independent, we expect p-values from the “A/A” test to follow a uniform (0,1) distribution. It is then trivial to compute the expected number of false positives when we repeat the simulation K=100 times, i.e. roughly 5 significant in X, 5 in Y, 5 in Z. Indeed, we observe 2 in X, 6 in Y, and 4 in Y.
Delta and its 95% confidence interval clearly shows, after re-randomization, the “A/A” test shows much better balanced groups.
Case 2: independent metrics, one log normal
Of course, rarely are we so lucky to have all normally distributed metrics. So now, let’s change things up by making Y follow a log-normal distribution. Similarly, in terms of false positives before re-randomization, there were 5 in X, 6 in Y and 6 in Z. After re-randomization, everything is well balanced.
Case 3: independent discrete values
What if our metric value is discrete, let’s check by making Z into a discrete variable. There were 6 false positives in X, 3 in Y and 6 in Z.
Case 4: Correlated metrics
Finally, we investigate correlation between metrics X, Y, Z. It is trivial to derive the expected number of false positives, we leave that as an exercise for readers, as well as why it is OK to use z-test in all of the above situations. Here, as an arbitrary choice, X and Y are moderately positively correlated, with correlation coefficient of 0.5, X and Z are mildly positively correlated with coefficient of 0.2, while Z has mildly negative correlation with Y (-0.1). With positively correlated X and Y, they had 11 and 6 false positives each ,while Z had 6.
In all of the cases above, it is clear that such a procedure improves balance, and can help us draw better inference in subsequent A/B tests.
This year, we sent 20 members of the Thumbtack team to PyCon in Montreal. We all had a great time, learned lots, and really made a name for ourselves. By the end of the conference, everyone knew who we were and that Thumbtack enables you to get your personal projects done.
We also had great swag: a comfy t-shirt, sunglasses, and a beer glass. However, unlike most other booths, we didn’t give it away for free. We wanted the PyCon attendees to work for it! For the third year in a row, we created a code challenge that engineers would have to correctly write up in Python to receive anything. At first, submissions slowly trickled in, but by the end of the conference, people were really excited to solve our problem. Some people didn’t even talk to us, just walked to our booth, picked up the challenge sheet, and walked away. In total, we got 87 submissions! And now, the beer our winners drink out of those glasses will taste a little sweeter because it’s flavored with sweet, sweet victory.
When I was little, my family went to our town’s district math night. We came back with a game that we still play as a family. The game is called Inspiration. It’s played with a normal deck of cards, with the picture cards taken out. Everyone gets four cards and one card is turned face up for everyone to see. You then have to mathematically combine your four cards with addition, subtraction, multiplication, and division to get the center card. The person who does it the fastest wins.
This year, our challenge was inspired by Inspiration, no pun intended. The first part asked people to write a Python program that takes in four numbers and determines the mathematical expression that can combine the first three numbers to get the fourth. If they could solve this, they were awarded a t-shirt and sunglasses. The harder challenge was to solve the same problem, but with an arbitrary number of inputs. The number to solve for was always the last number in the string, but the total number of operands was not constant. These solvers won the coveted Thumbtack beer glass.
Hall of Fame
Most of the solutions had some commonalities. They used brute force and they used Python’s built in library itertools to create permutations of the numbers and combinations with replacement of the operators. The following solutions were my favorites:
Greg Toombs had the shortest solution, with only 19 lines of code. You can find Greg on LinkedIn.
Robbie Robinson had one of the cleanest solutions. You can find Robbie on LinkedIn.
Thanks for everyone who submitted a solution! Can’t wait for PyCon next year!
We recently added automatic dependency injection to the PHP codebase powering our website. As we’ve said in the past, dependency injection is a good move for a lot of reasons. It leads to clearer, easier to understand code that is more honest about what it depends upon. Automatic dependency injection reduces boilerplate code to construct objects. And of course, it makes code easier to test.
But it had another benefit we weren't expecting. It made our pages load a few milliseconds faster.
Some of our dependencies are slow to construct because they need to read a settings file or because they use a library that instantiates a ton of objects. Our dependency injection framework allows dependencies to be constructed lazily, which (for most requests) means never constructing them at all. Better, faster code — what's not to love?
Ready to try out dependency injection for yourself? We made our library, ttinjector, public for all to use.
Page 1 / 11 »