Thumbtack Engineering Engineering

Fast iOS Functional Testing

Here at Thumbtack we use KIF to drive the functional tests for our iOS apps. For those unfamiliar with functional testing in iOS, KIF essentially allows us to write tests that programmatically mimic a user: touching, swiping and typing.

We're very much in full-swing, working our way through a long list of high priority features; the mythical feature-complete nirvana still far beyond the horizon. Every new feature (sometimes a refactor) introduces more functional tests. A major limiting factor of functional tests is that they're slow. For every programmatic tap, or swipe we have to wait for iOS to perform its elaborate animations. Our functional test runtimes were rapidly approaching 10 minutes. Compounding the situation, our CI setup runs tests on more devices & iOS versions than a developer typically has on hand, resulting in a high likelihood that a branch will fail CI when first pushed.

We like to iterate quickly, so something had to change. What if we could disable animations?

Disabling Animations

We were a little hesitant going down this road; we liked that our functional tests were a (somewhat) accurate reproduction of real world usage. Fortunately, disabling animations proved to be a quick and simple task, thanks to Method Swizzling. Replace a method here, replace a method there and, hey presto. We gave it a shot.

At this point the assumption was bold: swizzle some methods, reduce CI runtimes by a 100x and head out to the local watering hole. Nice.

Not so fast - or should I say - not fast enough?

We immediately witnessed that our tests weren't as reliable as they once were. KIF would timeout waiting for views to appear, our state machine would raise exceptions about invalid state transitions, assertions would fail, or even more ominous; Core Data's infamous "could not fulfill a fault" exception.

Race conditions everywhere

(as our UK readers will recognize immediately, that's John McCririck, a well known horse racing pundit)

Some of the state transition errors, assertions and Core Data crashes looked suspiciously similar to crashes we had been seeing in production, yet unable to reproduce locally. That's when the penny dropped. We had stumbled upon a method by which we could really put our apps through their paces, cause them some stress. Animations had been acting as a shroud over our eyes, denying us the opportunity to see the truth about the (in)stability of our apps.

A sizable portion of the work required to get CI green again was of a pedantic nature. KIF (and probably any iOS functional testing framework) makes it easy to write tests that pass based on implicit assumptions. For example, with animations enabled waiting for view A to appear also always means that view B is very likely now visible. Once animations are disabled however, these assumptions may no longer hold true. While this was quite annoying, it was a very small price to pay for the opportunity to fix some of our most common production crashes.

At this point, you may be asking: "All of your users will be using the app with animations enabled, how does disabling them represent a realistic scenario?". There are many, many factors that determine when exactly a particular unit of code may be executed. Now multiply these factors by all the different models of iOS devices in our customers' hands, and further multiply that with unique conditions under which their device is operating (memory pressure, network speeds, etc...). I like to think of having animations disabled as the most extreme conditions our apps might experience. By testing at much lower thresholds, we hopefully reduce the risk that variations in real world usage can result in a crash.

What We Fixed

While fixing bugs is always a rewarding experience, my personal favorite outcome of this endeavor was that it led to a much needed refactoring of our authentication mechanisms. But for those of you looking for something more tangible, here are a few of the bugs we fixed:

Sign out

When a user signs out of the app, we reset our Core Data stack; the main queue context and private persistent store contexts are deallocated and re-instantiated. With animations disabled, we immediately began to see Core Data faulting exceptions. Some controller(s) were attempting to perform operations using an NSManagedObject whose NSManagedObjectContext had been deallocated.

The immediate cause of the crashes were controllers holding strong references to managed objects. We fixed those by instead holding a reference to the NSManagedObjectID, and using -[NSManagedObjectContext existingObjectWithID:error:] to load the object as needed, and handling a nil return value.

The bigger problem was that the app could even get into such a state; controllers are still operating on data while sign out related actions are being performed. This broken down to two problems:

  1. Calls to our server-side API may complete after sign out, triggering controllers to attempt to reload data. This was solved by first waiting for all API calls to complete, and then performing a non-animated pop to our root view controller.

  2. We didn't have a state where we could be certain that all sign out actions were now complete. Previously, all sign out actions were performed on transition to a 'Not Authenticated' state. We moved all of these actions to a new 'Unauthenticating' state, and we could be certain that when we transition to 'Not Authenticated' all sign out actions have been performed and we can safely reset the Core Data stack.

NSOperation's completionBlock

In response to a server-side API call, we enqueue an NSOperation to map the JSON data onto objects. We were using -[NSOperation completionBlock] to perform a few more actions. A key detail of the behavior of completionBlock is that it's called after the operation is marked as finished. The documentation is clear on this, but it's a detail we overlooked.

Overzealous use of temporary private queue contexts

In our Thumbtack for Pros app we present Pros with invites to bid on jobs, these invitations are modeled literally as an Invite. Invites can expire, or for various reason become unavailable. When we detect an Invite is no longer available, we delete it, in a temporary child private queue context. We delete in a child context because there may be many Invites to delete and we want those deletions to happen in a single transaction.

We have another mechanism that constantly watches on the main queue context for any changes to an Invite, and then attempts to insert/update/remove a corresponding unmanaged InboxItem object. This mechanism created a temporary child context to process the changes to the invites. With animations disabled, we cause NSManagedObjectContextObjectsDidChangeNotification and NSManagedObjectContextObjectsDidSaveNotification to fire at a more frequent rate. Due to the increased workload, it became easier to trigger the crashes we'd seen in production. Because we used a private context to process changes, we had a race condition where the Invite exists when we begin to process the changes, yet is deleted at some point during processing. This results in a crash when we save the child context and Core Data attempts to reconcile changes to a deleted object.

Using a private context to process changes was admittedly a pre-mature optimization, so we opted to using the main queue context instead and thus negating the possibility of a data race. Arbitrarily deleting Invites still makes us uneasy, in the future we may move to a more deterministic approach where the concurrent impact is much easier to reason about.

Next Steps

KIF was clearly not designed with the expectation that some crazy developer might disable animations and expect their tests to instantly run at Ludicrous Speed. Unfortunately, KIF contains many sleeps; places where it must wait for iOS to do its thing. I presume this is primarily caused by the fact it uses private APIs that were not necessarily intended for the use of functional testing.

A few tweaks were needed to realize a more satisfactory reduction in runtimes. Those changes are available in my fork. I'm very much interested in hearing the KIF authors' thoughts on how we might further reduce the overhead, I'm sure there's a lot of low hanging fruit to pick.

A Practical Introduction to Testing

Coming to Thumbtack fresh out of Carleton College last summer, I had written about 5 unit tests in my life (and a few of them were in my interviews!). My original approach to testing was to think of the various scenarios a given feature might experience, and try them out manually. After a round of code review? Try them out again. This was slow, boring, painful, and error-prone – but there's a (much, much) better way!

Write automated tests

My mentor here suggested that I watch Misko Hevery's talk about testing: The Psychology of Testing. One of Misko's points was "everyone knows how to write tests, so why don't they?" At the time, I had some idea of how to write tests, but had no idea how to write good tests. What makes a test useful? How is test code different from production code? How should my code change to be more testable?

The following is meant to be a brief introduction to testing – why it matters, and some principles to keep in mind when writing tests.

Why write automated tests?

Let's take a step back and clarify – why are automated tests useful?

  1. Verify functionality: Writing code isn't easy, and you're bound to make mistakes. Writing tests helps ensure that your code is actually doing what you intended it to do.
  2. Prevent regressions: You're working as part of a team with other engineers. Those engineers write code that interacts with your code, and may alter the code you've written – tests ensure that such alterations don't break that code. Others on your team should feel confident that your code still functions correctly if your tests pass.
  3. Improve productivity: Running automated tests is a repeatable task that makes your life and your team's life easier. They're easy to run, and can be automated to run at important times such as pushing new code to master or attempting to deploy new code to production.
  4. Provide documentation: Looking at the tests for a particular piece of code can very clearly outline the expected behavior of that code given some input. A test method named test_returns_404_for_deleted_request makes it easy to quickly identify and understand that code's intended behavior.

Things to keep in mind when writing tests

So you're convinced that testing is a good idea and want to write some tests! Here are some things to keep in mind as you do so.

Test code is still code

Just like production code, test code will be read and maintained by other engineers. It should be just as understandable as production code – documentation is still important! Writing descriptive test names makes it easier for other engineers (and your future self) to understand what's being tested. Try to name tests with a structure similar to test_{expected behavior}_for_{scenario}, such as the example mentioned above (test_returns_404_for_deleted_request).

Test code should be simple

When working in a complex code base, tests will often require a decent amount of "set up" code. Break out common setup into helper methods, and leave test methods as simple as possible. A short test_ method lets the reader of the code focus on what's specific to that test. Similarly, test code should be as linear as possible – as a rule of thumb, try to limit the amount of indentation in a test. if statements, for or while loops, and other control flow add complexity to your test code, and should generally be avoided [0].

Tests should focus on functionality, not implementation details

It's tempting to want to write tests that check every little bit of how your code works. Tests that focus on implementation as opposed to functionality can slow down future engineers. If your code exposes some set of public methods, those are the methods that should be tested; not the way that those methods actually work under the hood. For example, if a class stores some state in a heap, a user of that class shouldn't need to know that. If that class is refactored to use an array instead, your tests shouldn't fail!

A "Change-detector" test is another example of a well-intentioned test that makes your code more difficult to work with in the future. Avoid testing how your code works, and instead focus on what functionality your code provides.

Dependencies should be injected

We love dependency injection here at Thumbtack [1]. There are quite a few reasons to use dependency injection, but let’s consider its benefits for testing with a simple example. Say your code sometimes [2] triggers the sending of an email to a user – in testing, we definitely don't want a real email to be sent! Injecting the dependency of an "email sender" allows us to write a test that ensures our code would send an email in production code, but doesn't actually do so in our test. Creating a “test double” to use instead of a real email sender enables us to verify the functionality of the class we are testing. In this case, we’ll create a “stub” email sender that stores the emails it has “sent”, and check its state after our code has run [3].

class ThingBeingTested(object):
    def __init__(self, email_sender):
        self.email_sender = email_sender

    def do_something_and_possibly_send_email(self):
        # some logic happens here that we want to test
        if should_send_email:

Then our test code looks something like:

class FakeEmailSender(object):
    def __init__(self):
        self.emails_sent = []

    def send_email(self, email):


def test_sends_email_in_particular_scenario(self):
    fake_sender = FakeEmailSender()
    thing = ThingBeingTested(fake_sender)
    # ... Any other configuration to set up a scenario where the email should be sent ...
    # Make sure we "sent" an email
    self.assertEquals(len(fake_sender.emails_sent), 1)

Then in production code, we instead inject a real email sender. We've successfully tested the logic that determines whether or not to send the email, and have left the actual email sending to another object.

Tests should be deterministic

Sometimes code relies on non-deterministic sources of data – third party APIs, random number generators, and time are a few examples of such data sources. How do you ensure that your tests don't spuriously pass or fail based on those outside sources? Dependency injection helps with this, as well.

Similar to our example above, consider a simple class that randomly assigns passengers to seats on an airplane. For the sake of this example, we'll assume that seats are identified by integers, and that seats 8 - 12 are in the exit row. Rather than using an actual random number generator, you can pass in a fake generator that returns some preconfigured value.

class AirplaneSeatAssigner(object):
    def __init__(self, random_int_generator):
        self.random_int_generator = random_int_generator

    def get_seat_assignment(self):
        seat_number = self.random_int_generator.get_random_int()
        if self.is_seat_in_exit_row(seat_number)
            # Double check that the passenger is OK with an exit row

class FakeRandomIntGenerator(object):
    def __init__(self, int_to_return):
        self.int_to_return = int_to_return

    def get_random_int(self):
        return self.int_to_return

def test_for_user_assigned_to_exit_row_seat(self):
    exit_row_seat_number = 9
    fake_generator = FakeRandomIntGenerator(exit_row_seat_number)
    assigner = AirplaneSeatAssigner(fake_generator)

    # ... check the expected behavior for an exit row...

Since we've injected a random number generator that we know will return a certain value, we can test the different scenarios that result from different random numbers. Even better, each time we run our tests, we'll get a consistent result.


Testing is an essential part of the software development process – writing good, useful tests makes your code more reliable, maintainable, and understandable for other engineers. I hope you found something here that helps you write better tests!

Notes and useful resources

[0] The exception to the rule is table-driven tests. This is a common and idiomatic pattern in Go, for example.

[1] See previous posts from Steve and Jeremy. We even found it to be performant!

[2] I've left this example purposefully simple and unrealistic – real test code might require extra setup!

[3] There is a subtle difference between the idea of a “mock” and a “stub” – Martin Fowler has written extensively on this topic. See his “Mocks Aren’t Stubs” article for more explanation.

Other resources

  • Martin Fowler writes a lot of great stuff on the topic of testing – it's worth just browsing his website.
  • Misko Hevery is a great resource, specifically for dependency injection
  • Google has an extensive blog about testing. The "Testing on the Toilet" series is great for providing small pieces of advice in short, simple to understand examples. One of my favorites describes three important qualities to keep in mind when writing tests.

The 12 Weeks of Thumbtack

On the first week of Thumbtack...


The office—I wouldn’t really call Thumbtack my “true love” (yet)—gave me a promise of exciting times to come. The week rushed by with on-boarding meetings and processes (more exciting that you might expect, especially with the words “Friday massages”); an All-Hands meeting, in which all engineers come together and ideas fly around like bullets; many delectable Thumbtack lunches and dinners; and a team bonding at the beach—a San Franciscan beach, i.e. a cold, bleak streak of sand at the edge of land, made enjoyable from the warmth of the surrounding company (pardon the pun). And that wasn’t all: I was internalizing the company goals and values, cramming 5+ years of team history and infrastructure into my brain, picking up Go, and spending hours of excitement with my mentor, Alex, in preparation for the launch of Kiki’s (email) delivery service[1].

On the second week of Thumbtack…

Alex asked me, “do you want to present at All-Thumbs tomorrow?” All-Thumbs is a company-wide gathering in which any team or team member can present… in front of everyone. So on my 9th day, I’m holding that microphone, standing in the spotlight, and trying to explain how Kiki will replace our current emailing setup on the website while simultaneously wondering whether the front row audience can hear the pounding of my heart.


Kiki had made good headway: we had spun up several million goroutines to track usage stats, written a preliminary version of the design doc and code, and had the environment all set up and running on AWS. On top of Kiki, I also talked about my work on a few Go packages that all our services now use. In fact, watching these packages imported by services going into production terrified me more than presenting at All-Thumbs: what if I had made a mistake? I could compromise the website’s security or crash our services, even after extensive code review (luckily, neither occurred). Only two weeks in, Thumbtack had thrown me challenges and projects with impacts beyond what I had ever experienced within the safety of school walls.

On the third week of Thumbtack…

I went back to school. I didn't even have to apply this time—I was automatically enrolled in Thumbtack University (my dream school, of course). Suffice to say, school and lecture are the same no matter where you go... so perhaps 20 hours of “class” didn't make the highlight of my week. I have to admit, though—the material was fascinating, and I came out amazed at how well the internal components to Thumbtack coordinated and worked effortlessly together. This rapid and intense onboarding procedure, while not as fun as my work with Kiki, clarified and made Thumbtack's mission concrete. So although I was uninspired to attend 20 hours of lecture, I gained more drive to help achieve Thumbtack's ultimate goals. I finally felt that I was a part of Thumbtack.

On the fourth week of Thumbtack...

Make Week kicked off! Make Week is a week in which the entire company emerges from whatever projects are currently underway to pursue their ideas on new features and product improvements that would otherwise be abandoned for more pressing issues. I had heard of companies holding hackathons or hack weeks, so I asked Marco at lunch (casual lunch with the CEO, no big deal) why Thumbtack adamantly stuck to “make” instead of “hack.” His response: “hack” is typically connected with engineering, and Make Week is a week intended for everyone—engineering, marketing, design, and more—to stretch their minds and innovate ways for the company to evolve. To imply that only engineers should participate would be to lose the valuable minds of 2/3 of the company. I found this to be yet another example of the enormous effort at Thumbtack to encourage transparency and communication across the company: no one team ever functions completely separate from another, and transparency works as the oil to keep the internal mechanics of Thumbtack running smoothly.

I took a break from Kiki this week and had the opportunity to pair with several different engineers to work on several Make Week projects of my own, including an internal server for Godocs, a script to automate setting up environments and applications, and a few more Go packages for our services. Finally, I again presented at All-Thumbs, albeit feeling slightly calmer this time. Make Week had been a time of exploration, and I had worked on several projects that would become increasingly important and useful during my internship.

On the fifth week of Thumbtack...

Where there are problems... there are Fix-Its and Thumbtack engineers. It was Fix-It week, the week in which Eng and Product tackled issues previously set aside for more urgent projects over the course of the last term. This included strategizing for the future to prevent predicted problems from ever being born. Although I was involved in planning for my team, AWS, whose general purpose is to migrate all servers to AWS, my involvement with Kiki and our Go skeleton also placed me in ArCo (Architecture Committee), which was to oversee our move to SOA (service-oriented-architecture). Sitting in our meetings, I was struck by the importance of this committee—we were essentially proposing a standard on how and why services could and should be created for years to come. And when Alex mentioned that Kiki would set the example for “best service practices”... well, no pressure. While I lacked the experience of many of the engineers in the room, it was strangely easy to voice my ideas and questions (everything was taken seriously, no matter how naïve I thought I sounded), which made me appreciate the openness of my teammates to new viewpoints and ideas. So on top of creating Kiki, I began documenting the steps to service creation and started a checklist of essential service elements. One of my biggest regrets is that we probably won't be able to complete this monumental task before this internship is over, and I wish I could stay to see the ultimate outcome.

On the sixth week of Thumbtack...

Things went wild. It was the week of the third quarter kickoff—an entire day of celebrating what we had achieved the previous quarter, and of gearing ourselves up to achieve our goals for the quarter ahead.

carnival The presentations foretold of the exciting, but simultaneously intimidating, months ahead. To conclude the quarter, we all headed over to a carnival-themed gathering, complete with aerial silk dancers, fortune tellers, donut burgers (I highly recommend, they were delicious), face painting, and much more. But the excitement didn't stop when we returned home at midnight...

Like most of us, I enjoy reading my phone notifications when I wake up in the morning. Unless it's an email at 7 A.M. from Mark (VP of engineering) reporting:

EMERGENCY: emails queued up, 0 sent in 2 hours.

Guess who was in charge of the emailing services? Yup... I had bloodied my hands with my first emergency. Luckily, the issue was resolved quickly—thankfully before Alex woke up—and with nearly zero impact, but I had learned my lesson well. At Thumbtack, we have a system of postmortems—every emergency is “owned” by one or a few people, and analyzed for what can be done to avoid anything similar in the future. I felt strangely proud to own my first postmortem; I had known before that a careless action could bring the system down, but the reality of it hit me hard. The postmortem now serves as my permanent reminder, like a burn scar that heralds back to a childhood memory of curious fingers playing with fire.

On the seventh week of Thumbtack...

Go Gophers! This week, I flew out with 5 other engineers to Denver, Colorado for the second-ever Gophercon, a conference dedicated to the Go programming language. I won't go into much detail about the 20+ fantastic talks, but I did document all my learnings in this Go wiki, which will hopefully prove useful in the future! I also somehow acquired a ticket to the sold-out Go workshop track, which were “deep dives” into some of the more advanced (and really cool) features of Go.


Besides the conference's enlightening and intellectual lectures, I got a taste of the Go community, and more broadly, the “coder community” at large. It struck me as a surprise that I was the only woman in a room of 100+ men during the workshops, and it took significant effort to find another woman in the conference of 1500 attendees (many of whom had long hair). Although the gender balance in engineering isn't quite 50-50 at Thumbtack or at school, I had never before experienced the notorious gender disparities and stereotypes as I did now, such as the automatic assumption that I was attending the “Go Bootcamp,” meant for beginners of Go, rather than the workshops intended for more experienced participants. It didn't make a difference in my learning experience, but admittedly there were some awkward moments, like when a bunch of guys refused to walk through a doorway until I had passed through. The conference had also amusingly assumed that that everyone attending Gophercon was above 21—after talking to the organizers about the drink tickets I had received for the after-party at the brewery, I was reassured the next conferences would be more minor-friendly.

Nevertheless, these small bumps during the trip did nothing to lessen my enjoyment of the conference and of Denver in general. My appreciation of Go definitely increased (one engineer jokingly called me a “go-fangirl”), and I spent my free time sampling some of the famous foods, museums, and historical districts of Denver.

On the eighth week of Thumbtack...

Kiki blasted off, sending more than 100 emails a second. We deployed Kiki to send all emails from all the engineers' test version of the website, as well as all the staging emails used in our second, Salt Lake City office. The usual testing process for new services includes unit tests, integration/end-to-end tests, load tests, and finally an A/B test to ensure that Kiki can handle everything that could possibly go wrong. Thus, we attacked Kiki with 7x more email requests than we currently handle, simulated network failures, and manually triggered panics, all through which (to my surprise) Kiki came out relatively unscathed. While not much new code emerged from the process, Kiki become more refined and robust, nearing its production birthday.

As exciting as email-sending was, more overwhelming were the 58 reporters and media frenzy that greeted me Thursday morning, causing traffic jams on 9th street as Jeb Bush's Uber driver meandered over to Thumbtack HQ (see more on his visit here). It was simultaneously terrifying and reassuring to meet one of the potentially most powerful people in the world—and to realize that even figureheads are human.

On the ninth week of Thumbtack...

Kiki hit a bit of a road-bump. Remember how I was on the architecture team? Well, one of the most argued points had been our way to ensure data persistence in the face of network failure or unpredictable hardware failures. Kiki currently had a simple system with goroutines and file storage for saving email data—for every request, Kiki would spawn a new goroutine, write data to file, send the email, and then delete the data, ensuring that, save for very rare occasions, unsent emails would not be completely lost. On top of email sending, we ran a “cron job” type goroutine that pulled unsent data from the file system and resent emails on a set schedule. Potentially the worst feeling in my internship so far was checking this code into a new branch, heading back to master... and deleting it all. The architecture team had agreed (myself slightly reluctantly) upon a data persistence system involving queues, a conditional write check to avoid duplicity, and a two-tier environment setup. Such a system could be implemented once in our shared library, tested extensively, and then be used by all. To have all our services following the same patterns would make designing new services and debugging existing ones more efficient, not to mention that long debates per service over that services' particular design patterns would be avoided on this topic. With this pattern, essentially all emails would be first placed in the queue, after which a “worker tier” application would pull from the queue and send the email, using a conditional write to ensure that the email had not yet been sent. The second tier, a “web tier” application, would expose an API to the outside world, allowing our application to respond to more than just requests from the queue, e.g. requests to unsubscribe emails marked as bounces.

yoga Starting over, although I understood the need to do so, was slightly disappointing—Kiki had been so close, and now productionization felt weeks away (weeks I didn't have). However, I now had the opportunity to abstract Kiki to be used for other notification services (SMS and push notifications), as well as organize Kiki's code into a more understandable package setup. I was more than determined to get Kiki back to production-ready by the end of the week, and was able to succeed, implementing the new data persistence design pattern, pulling out Jiji[2], Kiki's new webserver sidekick, into a new web tier environment, and reconfiguring our deployment scripts to work with dual application tiers. I ended a tiring, frustrating, but ultimately rewarding week with a session of rooftop yoga led by Jeremy—an upside-down San Francisco sunrise had never looked so good.

On the tenth week of Thumbtack...

We dark launched Kiki into production, essentially running Kiki in parallel to the current email-sending system so that Kiki could practice sending emails in production without actually sending them to their designated recipients. After the previous week of chaos and non-stop coding to get the refactored Kiki back into shape, I felt like I could stare at Kiki's metrics and dashboard forever and never get bored—it was unbelievable to watch those numbers tick every second and realize that production was actually happening.


And now was time to experiment! We ran profiling tools to figure out where CPU was used most and tweaked Kiki to perform even better, testing with different network resources, memory resources, and machine models (surprisingly, we were CPU-bound, mainly due to the context switching of goroutines) Kiki was becoming polished—I now could settle down and tidy up loose ends, making Kiki as perfect as possible. I also integrated our push notification system (Lakitu[3]) with Kiki, getting a taste of the mobile team's work and collaborating with their team members (I was tempted for a moment to leave AWS for mobile to obtain one of Thumbtack's iPhones... but decided to remain loyal to Android). It was an incredible feeling to realize that what started as a relatively small summer project—an emailing service—had transformed into something much bigger: a service that would set the standards for all services to come, and that would handle more than 5x the number of requests than initially planned. Although push notifications were not yet integrated with the website, I had finished with 2/3 of Kiki's final product—all the extra hours of work had definitely been worth it. For emails, what remained was to do an online A/B test with Kiki, to ensure that Kiki worked equally as well as the current script attached to the website. The week ended with a much needed break—an AWS team celebration of the past quarter's work, complete with a 14-dish, family-style dinner.

On the eleventh week of Thumbtack...

stats As exciting as spinning up new services and scripting new deployment features had been, this week was a time to visit the past. This meant plowing through fifteen code reviews and modifying seven of our code repos, including those of services that had been untouched for over half a year. We wanted to bring all our older services, such as Hercule, up to the standards by which Kiki now abides. Of course, before doing this, we had to first decide on Kiki itself: should we use flags or environment variables? Should something like ports be configurable to ensure future portability (pardon the pun)? Should we alter the code for readability and clarity, or keep it concise and add documentation instead? How should we track metrics and alert on errors? It was slightly exhausting to code a change, decide that we should remove it, and then decide later to change it back to the original. One thing I learned: if you put a group of highly knowledgeable (and opinionated) engineers in a room and debate a controversial decision, discussions can linger on forever—it's nearly impossible to find a solution that satisfies everyone. Sitting in those committees reminded me of debate tournaments—just when I found myself convinced by one point, someone would highlight the torrent of problems that came with it. The often heated back-and-forths definitely never got boring.

After working on all our services, I also cleaned up our deployment and service resource creation scripts and demoed these for our Engineering and Product teams! Every other week, we have "deep dives" into current projects underway or new procedures/tools that all engineers should know and use. My scripts fit into the latter category, and it was truly awesome to see other engineers across all teams using them—what started as a (slightly selfish) Make-week project to simplify my work setting up AWS environments and enforcing environment standards had turned into a productionized product that reduced what took hours to do into minutes, allowing any engineer to create AWS resources without double checking against Thumbtack's standardized setup configuration or asking someone on the AWS/infrastructure team. Immediately after the demo, I dove into my last, 20-minute All-Thumbs presentation, summarizing my work with Kiki, on the architecture team, and on our deployment scripts! Unlike my first, 3-minute presentation, the nerves had disappeared—and I left the podium (literally) dropping the mic.

On the twelfth week of Thumbtack...

I went off into the land of Big Data, learning how to use Hadoop, Hive, Spark, and more to extract data to analyze for Kiki's A/B test. We started off with 1% of production traffic sent to Kiki, then once metrics from emails sent from Kiki showed no significant deviation from baseline metrics, moved to 10% traffic, then 50%, and eventually to 100%! I paired with some of our data scientists on the Data Platform team—it was like stepping off into another world, leaving AWS to encounter the world of SQL and Hadoop clusters. With this final pairing, I realized I had actually paired with people on every single engineering team during my time here, either helping them with AWS or receiving help myself—I had integrated mobile push notifications into Kiki, dealt a little with our data platform, worked with the matching service, and code-reviewed some of our growth services. But while I had at least skimmed the surface of most of our engineering team and code-base, I had yet to work with designers or product managers—I guess working on our infrastructure and back-end services had to have some cons (although I didn't really mind at all).

I also had the chance to experience first-hand the impact of Thumbtack's product. One of the aspects I appreciate most about carsharing services is the conversation—not only do I get a ride, but I get the chance to have a nice chat along the way! During one of my excursions, I discovered my driver's dream was to open his own restaurant—and he was providing catering services to fundraise and bring his culinary skills out into the public. And so, of course, I brought up Thumbtack! Right before we parted, after I said that I hoped that Thumbtack will help him achieve his goals, he commented, "You must really love your job—I can hear it in the way you talk about it, it's really genuine." He couldn't have been more right. (I've checked back with him, and he's received 100 requests in the four days since he's signed up!) I also signed up as a pro myself for chamber music performance, and got my first hire this week!

After the launch of Kiki, things rapidly came to a close; we ended our internships with a slew of farewell dinners with Marco (CEO), Mark (VP Eng), and the team, which had come to feel like a second family. As always, the culinary team outdid themselves (dining hall food will definitely pale in comparison to this summer's meals, which included "farro risotto, lamb porterhouse, seared NY strip, zucchini a la plancha, blue cheese salad", and good old "buttermilk country fried chicken"). Although we were all sad to leave, the nights were full of our stories of our adventures, blunders, and most importantly, unforgettable learnings from this incredible summer. As we head back to school, we go armed with a new set of tools and experiences, and the knowledge that we made a difference in someone's life this summer.


A little about me—I'm currently a rising junior at Harvard studying computer science and mathematics. To be honest, I can't remember when I first heard about Thumbtack, but I do recall my first intense interview with Alex and encountering the genuine passion of the team to solve the many challenges Thumbtack faces. It's been an unforgettable experience, and I wish I could partake in Thumbtack's bright future ahead. Although my time at Thumbtack is over for now... perhaps yours is just about to start.


[1] Kiki's Delivery Service (魔女の宅急便, Majo no Takkyūbin) is a 1989 Japanese anime produced, written, and directed by Hayao Miyazaki and is based on Eiko Kadono's novel of the same name.

[2] Jiji is Kiki's anthropomorphy pet talking cat and closest companion. (Coincidentally, Jiji has a girlfriend, a white cat named Lily)

[3] Lakitus are Koopas who ride clouds through the skies, mostly dropping Spiny Eggs on Mario or Luigi.

When Randomization Is Not Enough: Improving Sample Balance in Online A/B Tests

We run a lot of A/B tests at Thumbtack. Because we run so many A/B tests at such a large scale, we want to make sure we run them correctly. One issue we’ve run into when running A/B tests is that a difference could still exist between the test and control groups by chance -- even if we randomize. This causes uncertainty in our online A/B tests. In these cases, the question we need to answer is: if we observe a difference, is it because of the test feature we just introduced, or because the difference is pre-existing?

We propose that for online designed experiments, a proper randomization procedure enables us to attribute an observed difference to the test feature, instead of to a pre-existing difference in the test groups. Implementing this approach has given us greater confidence in our A/B test results than we had previously. We have drawn from the PhD thesis of Kari Frazer Lock in developing this approach to our A/B tests.


The Problem

To illustrate the problem, let’s consider an example. Suppose our engineering team has decided to play a friendly game of tug-of-war after eating their favorite superfoods. To test out which kind of food helps people win, we randomly assign them into two groups. Team A channels their inner Popeye and eats spinach salad, and team B decides to chow down on their superfood of choice: kale. Team A wins and claims that spinach induces superior strength. Is that so?

We all have different heights, weights, inherent strength, etc. Suppose that by chance alone, team B members have an average weight that is 15 pounds lighter than team A, and all the folks over 6 feet tall ended up in team A. Plus, team A also got 18 engineers and team B got 17. Adding up all the differences, team A ended up with a disproportional advantage!

Spinach, that wasn't fair!

Variation in Test Units

In online A/B experiments, the test unit is a usually a user or visitor. Users differ greatly among each other: some visit very often, some have faster internet, some use mobile exclusively, etc. When each user interacts repeatedly with our site, we can proactively seek to balance out users based on historical data.

Randomization alone is not enough

Even if we randomize the initial assignment of uses between test and control groups, a difference could still exist between test and control groups that is due not to the test feature but rather due to chance alone. This causes uncertainty in our online A/B tests. If we observe a difference, is it because of the test feature we just introduced, or because the difference is pre-existing?

If we have baseline characteristics for test subjects, we can try to balance the test and control groups on these characteristics before we run the experiment so any observed difference can be attributed to the test feature.

Obviously, this is only feasible when we have some information on the test subjects. When Thumbtack was relatively new and most observed interaction between users and the product came from new users, this step could not be done. Now that Thumbtack gets a lot of repeat visits, we can strive to balance out experiments in a way we weren’t able to do previously and thus get more accurate measures which give us more confidence in our conclusions.

The chance that at least one of the test groups has different baseline characteristics rises as variation increases among test units and as number of samples in each group declines (e.g. when we test multiple variants in a single experiment).

The Solution V1: A/A Tests

One natural solution is to run an A/A test on the test groups before the test feature is introduced. For example if we will roll out a feature in the next month, we can assign the users according to some rule into two groups, and measure their metrics in last month's data. In the month before our test, the test and control groups should have no pre-existing difference, and thus are "A/A" as opposed to “A/B” tests. Using our tug-of-war example, an “A/A” test would be a game, with both teams on the same diet.

Historically this is how randomized experiments are done in the biomedical field. In any published paper on such studies, the very first section is to establish there is no existing imbalance (the famous “table 1"). And if there are any, they can still be accounted for in downstream statistical analysis.

For online A/B tests for any web facing company, running A/A tests is a relatively cheap solution for a step in the right direction. When an A/A test shows a pre-existing imbalance, i.e. the test "fails", we should take caution in interpreting the A/B test result. Depending on the severity of the imbalance we can choose to ignore, statistically adjust, or re-run the experiment.

But there has to be a better way than waiting to see if an “A/A” test fails?

The Solution V2: Repeated Re-randomization

In online A/B tests, test units are usually assigned according to their id. And a random seed ensures each experiment uses a different randomization. Usually a hash function, say, SHA1, takes the seed and user id, and turn that into an integer. Then these integers are split into test groups.

We can repeatedly compute A/A tests results until we have found a split where all A/A tests are flat. This step can greatly reduce the chance of a failed A/A test run on the pre-experiment period.

It turns out to be quite simple in theory! - Randomly select a seed and randomize test subjects by this seed. - Run the A/A test on all metrics of interest. - If the A/A test fails on any dimension, discard and go back to Step 1.

This way we will end up with a seed that can balance test subjects. This procedure can go through anywhere from tens to thousands of seeds before finding a balanced one. The number of seeds you need to go through to find a balanced one depends on how many baseline characteristics we want to balance on and on the amount of variation among subjects.


In theory it is possible for our historical metrics to be correlated so that the repeated procedure can take unreasonably long to find a proper assignment. In practice, we keep an upper bound in number of trials M, and we trace all seeds with corresponding minimum p-values across all the baseline characteristics. If the procedure fails to find an optimal seed that sufficiently balances the treatment variants within M steps to the pre-specified thresholds, we return the best result for a human to judge. This way we guarantee the procedure has a stop point.

What does the human judge do? Based on domain knowledge and business priorities, this human (a data scientist at Thumbtack) can choose either to re-run the procedure, or decide if the best of M results is good enough, or further scrutinize metric computation and selection via offline analysis. Waterproof solution?

Waterproof solution?

But does this procedure guarantee perfect balance in our test and control group every time?

No. It only minimizes imbalance the best we can. Potential reasons include: Randomness. Observations from random variables have this inherent random nature due to endogenous and exogenous reasons.

  • Existing users can change their behavior, independent of our test feature.
  • Within-company changes, e.g. an ad campaign could start and affect regions in only one of the test groups.
  • Another team could start an experiment in the next month that inadvertently and partially overlaps with our experiment.
  • A subset of users shows strong seasonal difference, e.g. snow plowing and yard work.
  • Externalities, e.g. competitor that targets a certain segment could show up and affect the whole marketplace.
  • New users may sign up during the test period. We cannot balance new visitors, we can rely only on randomization and thus imbalance could occur.

The solution we have developed is not waterproof. However, implementing it in our A/B tests has given us greater confidence in our A/B test results than we had previously.

Empirical Results via Simulation

To illustrate, we simulate three metrics X, Y and Z, measured over N users, from a multivariate normal distribution with pre-specified mean and variance-covariance structure, and randomly split them into two groups, and test for difference, i.e. perform “A/A” tests. In the following examples, we assume a total of 50,000 users, two equal sized variants, we simulate 100 rounds and count the number of false positives for each metric.

Case 1: independent normal

When X, Y, and Z are independent, we expect p-values from the “A/A” test to follow a uniform (0,1) distribution. It is then trivial to compute the expected number of false positives when we repeat the simulation K=100 times, i.e. roughly 5 significant in X, 5 in Y, 5 in Z. Indeed, we observe 2 in X, 6 in Y, and 4 in Y.

Delta and its 95% confidence interval clearly shows, after re-randomization, the “A/A” test shows much better balanced groups. Independent

Case 2: independent metrics, one log normal

Of course, rarely are we so lucky to have all normally distributed metrics. So now, let’s change things up by making Y follow a log-normal distribution. Similarly, in terms of false positives before re-randomization, there were 5 in X, 6 in Y and 6 in Z. After re-randomization, everything is well balanced. Lognormal

Case 3: independent discrete values

What if our metric value is discrete, let’s check by making Z into a discrete variable. There were 6 false positives in X, 3 in Y and 6 in Z. Discrete

Case 4: Correlated metrics

Finally, we investigate correlation between metrics X, Y, Z. It is trivial to derive the expected number of false positives, we leave that as an exercise for readers, as well as why it is OK to use z-test in all of the above situations. Here, as an arbitrary choice, X and Y are moderately positively correlated, with correlation coefficient of 0.5, X and Z are mildly positively correlated with coefficient of 0.2, while Z has mildly negative correlation with Y (-0.1). With positively correlated X and Y, they had 11 and 6 false positives each ,while Z had 6.


In all of the cases above, it is clear that such a procedure improves balance, and can help us draw better inference in subsequent A/B tests.


PyCon 2015: We Make You Work for Your Swag

Thumbtack t-shirts

This year, we sent 20 members of the Thumbtack team to PyCon in Montreal. We all had a great time, learned lots, and really made a name for ourselves. By the end of the conference, everyone knew who we were and that Thumbtack enables you to get your personal projects done.

We also had great swag: a comfy t-shirt, sunglasses, and a beer glass. However, unlike most other booths, we didn’t give it away for free. We wanted the PyCon attendees to work for it! For the third year in a row, we created a code challenge that engineers would have to correctly write up in Python to receive anything. At first, submissions slowly trickled in, but by the end of the conference, people were really excited to solve our problem. Some people didn’t even talk to us, just walked to our booth, picked up the challenge sheet, and walked away. In total, we got 87 submissions! And now, the beer our winners drink out of those glasses will taste a little sweeter because it’s flavored with sweet, sweet victory.

Our Challenge

When I was little, my family went to our town’s district math night. We came back with a game that we still play as a family. The game is called Inspiration. It’s played with a normal deck of cards, with the picture cards taken out. Everyone gets four cards and one card is turned face up for everyone to see. You then have to mathematically combine your four cards with addition, subtraction, multiplication, and division to get the center card. The person who does it the fastest wins.

This year, our challenge was inspired by Inspiration, no pun intended. The first part asked people to write a Python program that takes in four numbers and determines the mathematical expression that can combine the first three numbers to get the fourth. If they could solve this, they were awarded a t-shirt and sunglasses. The harder challenge was to solve the same problem, but with an arbitrary number of inputs. The number to solve for was always the last number in the string, but the total number of operands was not constant. These solvers won the coveted Thumbtack beer glass.

Hall of Fame

Most of the solutions had some commonalities. They used brute force and they used Python’s built in library itertools to create permutations of the numbers and combinations with replacement of the operators. The following solutions were my favorites:

Greg Toombs had the shortest solution, with only 19 lines of code. You can find Greg on LinkedIn.

Joshua Coats had the best commented solution. He definitely make me chuckle. You can find Joshua on GitHub and Twitter.

Robbie Robinson had one of the cleanest solutions. You can find Robbie on LinkedIn.

Thanks for everyone who submitted a solution! Can’t wait for PyCon next year!

Page 1 / 11 »