A primer on Python decorators

Python allows you, the programmer, to do some very cool things with functions. In Python, functions are first-class objects, which means that you can do anything with them that you can do with strings, integers, or any other objects. For example, you can assign a function to a variable:

>>> def square(n):
...     return n * n
>>> square(4)
16
>>> alias = square
>>> alias(4)
16

The real power from having first-class functions, however, comes from the fact that you can pass them to and return them from other functions. Python’s built-in map function uses this ability: you pass it a function and a list, and map creates a new list by calling your function individually for each item in the list you gave it. Here’s an example that uses our square function from above:

>>> numbers = [1, 2, 3, 4, 5]
>>> map(square, numbers)
[1, 4, 9, 16, 25]

A function that accepts other function(s) as arguments and/or returns a function is called a higher-order function. While map simply made use of our function without making any changes to it, we can also use higher-order functions to change the behavior of other functions.

For example, let’s say we have a function which we call a lot that is very expensive:

>>> def fib(n):
...     "Recursively (i.e., dreadfully) calculate the nth Fibonacci number."
...     return n if n in [0, 1] else fib(n - 2) + fib(n - 1)

We would like to save the results of this calculation, so that if we ever need to calculate the value for some n (which happens very often, given this function’s call tree), we don’t have to repeat our hard work. We could do that in a number of ways; for example, we could store the results in a dictionary somewhere, and every time we need a value from fib, we first see if it is in the dictionary.

But that would require us to reproduce the same dictionary-checking boilerplate every time we wanted a value from fib. Instead, it would be convenient if fib took care of saving its results internally, and our code that uses it could simply call it as it normally would. This technique is called memoization (note the lack of an ‘r’).

We could build this memoization code directly into fib, but Python gives us another, more elegant option. Since we can write functions that modify other functions, we can write a generic memoization function that takes a function and returns a memoized version of it:

def memoize(fn):
    stored_results = {}

    def memoized(*args):
        try:
            # try to get the cached result
            return stored_results[args]
        except KeyError:
            # nothing was cached for those args. let's fix that.
            result = stored_results[args] = fn(*args)
            return result

    return memoized

This memoize function takes another function as an argument, and creates a dictionary where it stores the results of previous calls to that function: the keys are the arguments passed to the function being memoized, and the values are what the function returned when called with those arguments. memoize returns a new function that first checks to see if there is an entry in the stored_results dictionary for the current arguments; if there is, the stored value is returned; otherwise, the wrapped function is called, and its return value is stored and returned back to the caller. This new function is often called a “wrapper” function, since it’s just a thin layer around a different function that does real work.

Now that we have our memoization function, we can just pass fib to it to get a wrapped version of it that won’t needlessly repeat any of the hard work it’s done before:

def fib(n):
    return n if n in [0, 1] else fib(n - 2) + fib(n - 1)
fib = memoize(fib)

By using our higher-order memoize function, we get all the benefits of memoization without having to make any changes to our fib function itself, which would have obscured the real work that function did in the midst of the memoization baggage. But you might notice that the code above is still a little awkward, as we have to write fib three times in the above example. Since this pattern – passing a function to another function and saving the result back under the name of the original function – is extremely common in code that makes use of wrapper functions, Python provides a special syntax for it: decorators.

@memoize
def fib(n):
    return n if n in [0, 1] else fib(n - 2) + fib(n - 1)

Here, we say that memoize is acting decorating fib. It’s important to realize that this is only a syntactic convenience. This code does exactly the same thing as the above snippet: it defines a function called fib, passes it to memoize, and saves the result of that as fib. The special (and, at first, a bit odd-looking) @ syntax simply cuts out the redundancy.

You can stack these decorators on top of each other, and they will apply in bottom-out fashion. For example, let’s say we also have another higher-order function to help with debugging:

def make_verbose(fn):
    def verbose(*args):
        # will print (e.g.) fib(5)
        print '%s(%s)' % (fn.__name__, ', '.join(repr(arg) for arg in args))
        return fn(*args) # actually call the decorated function

    return verbose

The following two code snippets then do the same thing:

@memoize
@make_verbose
def fib(n):
    return n if n in [0, 1] else fib(n - 2) + fib(n - 1)
def fib(n):
    return n if n in [0, 1] else fib(n - 2) + fib(n - 1)
fib = memoize(make_verbose(fib))

Interestingly, you’re not restricted to simply writing a function name after the @ symbol: you can also call a function there, letting you effectively pass arguments to a decorator. Let’s say that we aren’t content with simple memoization, and we want to store the function results in memcached. If we’ve written a memcached decorator function, we could (for example) pass in the address of the server as an argument:

@memcached('127.0.0.1:11211')
def fib(n):
    return n if n in [0, 1] else fib(n - 2) + fib(n - 1)

Written without decorator syntax, this expands to:

fib = memcached('127.0.0.1:11211')(fib)

Python comes with some functions that are very useful when applied as decorators. For example, Python has a classmethod function that creates the rough equivalent of a Java static method:

class Foo(object):
    SOME_CLASS_CONSTANT = 42

    @classmethod
    def add_to_my_constant(cls, value):
        # Here, `cls` will just be Foo, but if you called this method on a
        # subclass of Foo, `cls` would be that subclass instead.
        return cls.SOME_CLASS_CONSTANT + value

Foo.add_to_my_constant(10) # => 52

# unlike in Java, you can also call a classmethod on an instance
f = Foo()
f.add_to_my_constant(10) # => 52

Sidenote: Docstrings

Python functions carry more information than just code: they also carry useful help information, like their name and docstring:

>>> def fib(n):
...     "Recursively (i.e., dreadfully) calculate the nth Fibonacci number."
...     return n if n in [0, 1] else fib(n - 2) + fib(n - 1)
...
>>> fib.__name__
'fib'
>>> fib.__doc__
'Recursively (i.e., dreadfully) calculate the nth Fibonacci number.'

This information powers Python’s built-in help function. But when we wrap our function, we instead see the name and docstring of the wrapper:

>>> fib = memoized(fib)
>>> fib.__name__
'memoized'
>>> fib.__doc__

That’s not particularly helpful. Luckily, Python includes a helper function that will copy this documentation onto wrappers, called functools.wraps:

import functools
def memoize(fn):
    stored_results = {}

    @functools.wraps(fn)
    def memoized(*args):
        # (as before)

    return memoized

There’s something very satisfying about using a decorator to help you write a decorator. Now, if we were to retry our code from before with the updated memoize, we see the documentation is preserved:

>>> fib = memoized(fib)
>>> fib.__name__
'fib'
>>> fib.__doc__
'Recursively (i.e., dreadfully) calculate the nth Fibonacci number.'

Thumbtack is hiring engineers! Come work with us on making it easy to hire service professionals, and enjoy our in-house chef and sweet San Francisco office.

Gambling with the devil: A/B tests done right

“Designing an experiment is like gambling with the devil: Only a random strategy can defeat all his betting systems.” (R. A. Fisher)

Abba previewThese days just about everyone does some form of A/B testing to optimize pages. But as Ronald Fisher knew, A/B testing is loaded with traps, and the only way to avoid them is through careful use of randomization and statistics.

There are plenty of free tools out there that make A/B testing easy and accessible, but not all tools are created equal. Google’s Website Optimizer is one of the most complete and polished, designed to gather events directly from your site and display results in a nice report. GWO is a great way to get started with A/B testing and is how we ran our first few tests at Thumbtack.

Pretty soon, however, we started to outgrow to limited event gathering and reporting features of GWO. As data-driven decision making runs strong in our DNA, we decided to develop our own event tracking and reporting system. One component of this system was an A/B test report inspired by GWO.

As the months have gone by, we’ve come to realize just how valuable this tool is, and naturally we wanted to share it with the world. As such, we’ve packaged the code up into a nice, reusable Javascript library with a demo app that lets you enter your test results and get a handy report. It’s simple to use, runs entirely in the browser, lets you pass links to others, and uses some nifty statistics under the hood.

So give Abba a spin, check out the source, and let us know what you think! The demo page has a full FAQ where you can find plenty of details about how to interpret the report and how everything works under the hood.

http://www.thumbtack.com/labs/abba/

How we got people to earn our schwag

This year, Thumbtack was one of the sponsors of the PyCon conference. Our sponsorship got us a booth in the conference’s expo hall, and hence the opportunity to tell people what we’re all about.

Everybody with a booth wants to give visitors something to take home, which inevitably leads to the tide of mediocre schwag that barrages people at tech conferences. We certainly wanted to have something to offer, but didn’t want to be lost in the fray of t-shirts, stickers, and flyers.

Our solution: Bring something cool, and convince people to write some code before they get it.

Thumbtack's fine glasswareFor the conference, we ordered some high-quality beer mugs and shot glasses with our logo on it. The “high-quality” part of that is important – you never know when you’ll be placed across from a booth also offering shot glasses (as we were!). But our neighbor’s glasses looked like they would shatter if you set them down too eagerly after a shot, whereas our glasses have a satisfyingly thick base, giving you the confidence to slam them down (and that they might survive your flight home).

Instead of giving away our glasses, we asked our visitors to first complete a fun programming challenge. We tried solving a few candidate problems before the conference, until we found one that all of our engineers could solve in Python in about ten minutes.

The winning problem was this: Write a program that, given a Connect Four board represented as a two-dimensional JSON array, output which player has won the game, or “No winner” if nobody has won. We printed up this challenge in more detail, and handed out the papers from our booth.

It turns out that, “Interested in a little coding challenge?” is a great hook to bring people over to your booth. (People who go to programming conferences often like to program.) Once people got the gist of the question, they were very likely to ask, “So, what does Thumbtack do?”, and give us an opportunity to interest in more than just our glassware.

We estimate that we gave out about 300 copies of the problem, and got 51 solutions – not a bad conversion rate.

Personally, I’m very happy with how our little experiment went. Instead of lazily offering shirts to anyone who happened to walk over, we gave people a reason to engage with us, and got some people very interested in our mission. I guarantee you that everyone who got Thumbtack schwag at PyCon remembers talking to us, and remembers what we do. How many other small companies at the conference can say that?

Hall of Fame

The challenge in solving the Connect Four problem is not coming up with an algorithm, but coming up with a clean expression of that algorithm in code. Here are some of the best and most interesting submissions we received (all reproduced here with the permission of their authors).

My favorite submission overall came from Sam Merritt, who not only implemented Connect Four in n dimensions, but also wrote one of the best-factored solutions:

A number of people looked at our problem and immediately interpreted it as a matrix problem, and many submissions we got make use Numpy and/or SciPy. There were a number of good solutions in this category, but one of the shortest comes from Renzo Sanchez-Silva:

Conversely, we had a few intrepid people who looked at our problem and apparently thought, “I can solve that using regular expressions”. Unsurprisingly, these solutions had the highest chance of not passing our test suite, but my favorite working regexp submission comes from Dan Callahan, who would like me to disclaim that he was trying for a convoluted solution.

We had a few Python core contributors stop by our booth, and I made (what I thought was) an offhand remark to one of them that I was still waiting for someone to solve the problem with one big generator expression. I clearly underestimated Łukasz Langa, because some time later, we received this:

Finally, in an elegant argument both for and against significant whitespace, we got a solution from Alex Lewin as a 216-character Perl one-liner:


Shameless plug: There’s another way to get a Thumbtack beer mug – come work here! Thumbtack is making it easy to hire any service online, and we’re hiring engineers to help us build an awesome product.

Food rules for startups: eight delicious ways to build a better company

At Thumbtack, we make food a priority. It’s amazing what eating does to bring people together. Our team feels like a big family, and this is in large part due to the fact that we share most of our meals: four days a week we eat lunch together around a big table in the center of the office, and once a week we all sit down for a big family-style dinner.

Often startups try to attract talented teammates by offering benefits like ping-pong tables, video games, or gym memberships. While those things are valuable, a culture of good food is an order of magnitude more important. Sharing meals around quality food builds an environment that encourages collaboration and celebrates excellence. The team is excited to come to work because they value and respect the full work environment. We believe every company can benefit from a food-centric culture.

Many of the ideas we have about food are based on the work of UC Berkeley professor and food writer Michael Pollan. His book Food Rules documents some interesting, if old-school, ways to think about food. He avoids writing about specific diets or nutritional fads. Before nutritional scientists started writing about cholesterol and calories, people used different guidelines to decide what and how to eat.

At the risk of sounding too Bay Area, these older “rules” can lead to more holistic concepts of meals, nourishment, and health. Rules come in the form of axioms and old wives’ tales: “better to pay the grocer than the doctor,” “eat your colors”, or “the whiter the bread the sooner you’ll be dead.” The new edition, released late last year, features some great illustrations by Maira Kalman and inspired much of this blog post.

With that, we’d like to present the food rules we’ve come to adopt at Thumbtack.

Rule #1 – Eat lunch together around a table.

Eating lunch together is the single most important culture-building activity we do. This rule has three distinct and equally valuable parts.

a) Eat lunch. At a basic level, food is fuel. Your team needs to eat so they have raw energy to make awesome things.

b) Together. A team that eats together learns, connects, develops friendships, and collaborates more. Joel Spolsky writes about meals at Fog Creek:

The importance of eating together with your co-workers is not negotiable, to me. It’s too important to be left to chance. That’s why we eat together at long tables, not a bunch of little round tables. That’s why when new people start work at the company, they’re not allowed to sit off by themselves in a corner. When we have visitors, they eat together with everyone else.

c) Around a table. This is also Pollan’s rule #58:

No, a desk is not a table. If we eat while we’re working, or while watching TV or driving, we eat mindlessly — and as a result eat a lot more than we would if we were eating at a table, paying attention to what we’re doing. This phenomenon can be tested (and put to good use): Place a child in front of a television set and place a bowl of fresh vegetables in front of him or her. The child will eat everything in the bowl, often even vegetables that he or she doesn’t ordinarily touch, without noticing what’s going on. Which suggests an exception to the rule: When eating somewhere other than at a table, stick to fruits and vegetables.

Rule #2 – Have a weekly all-hands dinner.

We’ve tried other nights, but we really like Wednesday nights for dinner. People tend not to have conflicting evening plans on Wednesdays, and the midweek perk of a delicious dinner helps break the hump-day doldrums.

At dinner, crack some beers and open a bottle of wine. Encourage your team to relax, stop working for a little while, and get to know each other even better. Celebrate what you’ve accomplished that week. Conversation inevitably comes back to the work you’re all doing; don’t worry when that happens, as you’ll have amazing ideas late in the evenings that (sometimes) turn out to be worthwhile.

After some wine, your engineers might try to argue that the Ballmer Peak is a real thing. You should humor them, but under no circumstance should you let them hit that “deploy” button.

Rule #3 – Hire a chef.

We mean it. Get an office with a big kitchen where your chef can work, and buy all the kitchen gadgets, pots, and pans that your chef wants. Make your chef happy and you will receive incredible food. This will become a point of pride for your company. Our chef Thea was trained at Le Cordon Bleu and has been part of the team for almost three years. You’ll be so happy with your chef, you’ll write blog posts about how great it is.

If you can’t hire a chef, you should hire a caterer to provide regular, healthy meals. We’ve had good luck with ZeroCater, and it’s likely a good option if your office is in Silicon Valley. If it’s not, Thumbtack can help you find a caterer no matter where you run your business.

If you think you can’t afford it: think hard about how much efficiency you’re losing by not facilitating interaction in your workplace. When your team members go out for lunch, they’re distracted and have to pay their own money and think about what to order and how much to pay. Spolsky writes that it’s a manager’s job to take away all the pains of everyday life so engineers can focus on what they’re good at: engineering. Take away the hassle of finding food.

Rule #4 – Invite guests.

Having awesome meals at your office means people will want to visit you. This is a great way to network, grow awareness about your company, and learn from all sorts of people you wouldn’t know otherwise. If you’re trying to recruit, lunch is a great way to entice new candidates and have them meet your team without the need for formal interviews. It’s also a great excuse to have people over who might not yet know they want to work for you.

One of my colleagues at Thumbtack makes it a priority to invite someone new for lunch every day of the week: this always brings something unique to the conversation and we always end up learning something we didn’t know. We make a point to invite our investors to lunch so they can get to know the team and provide feedback on the business. We have had the occasional celebrity guest to mix things up. I can’t divulge them all, but my personal favorite was Kevin Kelly.

Rule #5 – Don’t buy vending machines.

Vending machines are the easy answer for providing food for your employees. But the things that make vending machines good also makes them bad. It’s great that packaged food has a long shelf-life, but it’s also indicative of food that’s packed with preservatives and lacking actual nutrients, not to mention flavor. This kind of food encourages your team to eat alone, at their desks, any time of day. Pollan’s rule is “Don’t eat anything that won’t eventually rot” and “If it came from a plant, eat it; if it was made in a plant, don’t.”

But don’t think this means you should shy away from decadent food. Pollan also writes, “eat all the junk food you want as long as you cook it for yourself.” Thumbtack often indulges in fried chicken, juicy steak dinners, bread puddings, and chocolate tortes.

Rule #6 – Provide sane breakfasts.

“Don’t eat breakfast cereals the change the color of the milk” (food rule #36). A hearty breakfast has so many benefits: at a basic level it provides fuel for the morning’s work. Breakfast also prevents metabolic highs and lows that can—let’s be honest—really affect the mood and productivity of your team. Let’s also be honest and admit that your engineers probably won’t be starting work until noon, and breakfast may not be the most important meal of their day. All the more reason to get lunch right, and meanwhile make sure there’s still a good breakfast for all those marketing and business development folks who tend to come in on the early side.

Rule #7 – Coffee, tea, and espresso are good.

Caffeine is clearly an aid to concentration, inspiration, and productivity. You might try to argue against this, but the reality is that your colleagues will be chugging the stuff and you should learn to understand them.

At Thumbtack, we buy the best beans from local roasters like Blue Bottle, Sightglass, or Four Barrel. We purchased a Nespresso machine that instantly brews delicious single-serving espressos or Americanos. Our in-house tea drinkers place a weekly order for green tea. While we also have a few Cokes in the fridge, but we intentionally don’t make soda a priority.

Rule #8 – Eat food, not too much, mostly plants.

This rule is taken straight from Michael Pollan’s classic book Omnivore’s Dilemma. “Eat foods, not nutrients,” Pollan writes. “Stay out of the middle of the supermarket.” Thumbtack’s chef makes all our meals from scratch, starting with fresh, seasonal, and often local ingredients. She works hard to build balanced meals that would make the food pyramid jealous. We stay full, stay healthy, and stay at the office.

And really, eating good, real, fresh food is just better for you and your team any way you look at it. Your doctor will be happy. Your health insurance premium will be happy. You will be happy.

This point is a really big deal for Pollan. “Don’t eat anything your grandmother wouldn’t recognize as food.” “Don’t eat anything with more than five ingredients, or ingredients you can’t pronounce.” There are many foods we eat at Thumbtack that can be difficult to pronounce—bo ssam, mee goreng, cioppino—but they are all made fresh that day.

Conclusion

We think a good culture of food can be the #1 driver of company culture. Work hard, eat well. Everything else is just icing on that cake.

Want to stop by and taste some delicious home-cooked meals? Find us on Twitter. We love to meet new people. Our office is near Powell St. Bart in San Francisco.

Also, did I mention we’re hiring? Eat awesome food with us every day! Apply here.

SEO Tip: Titles matter, probably more than you think

As a preface, I want to mention that this post is not any sort of secret formula for SEO. The only way to succeed at SEO is to deliver relevant content to a user, and if you don’t do that you are not going to succeed.

However, like most things there are always opportunities for small optimizations. As part of our culture of testing as Thumbtack, we recently decided to test an often overlooked part of a webpage and see if it would impact search traffic: the title tag.  The reason we did this is because often times whatever you put in your title tag is what search engines will make the headline of your listing on a search results page.  Here is what a basic result for one of our pages looks like:

Basic Search Listing

If you visit that page, you will see that what we have in the title tag is exactly what Google has chosen to put as the title of the listing on the search result page.  Great!

The Test:

We decided to A/B test 3 different variations of the title tag and see what impact it would make.  Here are the variations we chose (with location and service type substituted for each specific page):

  • Looking for the best House Cleaning Services in San Francisco? (baseline)
  • Get Free Quotes Today From House Cleaning Services In San Francisco (quotes first variation)
  • House Cleaning Services in San Francisco – Get Free Quotes Today (quotes last variation)

To make sure the test would have enough data, we opted in a few thousand pages into each bucket, then set them loose.  Our pages are indexed somewhat often so we felt fairly confident that the results within a week or two would be significant, i.e., if there was no change in traffic within the first two weeks then we were confident that was because the buckets were equal, not because the changes had not been picked up.

The Results:

Getting results for a test like this is challenging because search traffic has a lot of variance in it.  Not only do you have to deal with differing traffic based on the day of the week, but in the background your search traffic might be going up and down from things outside of your control (i.e., algorithm changes).  So, to control for variance in search traffic, instead of getting our results from raw hit numbers, we instead looked at the ratio of hits from our experimental titles to hits from our baseline title.

The key dates on the graph are October 7th when we launched the test, and October 14th when we ended the test.

SEO Title Test Result

As you can see, the result came quickly and painfully. The alternate variations underperformed the baseline by 20-30%.  Remember, nothing else on the pages changed: not the content, not the H1 tag, not the meta description; the only change on the page was the wording between the <title> and the </title>.

After letting the test run for a week and with the results indisputable, we reverted the titles back to the baseline and hoped the traffic would return back to normal levels.  And as you can see in the above graph, for the most part the traffic did come back.  We clearly had made an SEO mistake, but once corrected and re-indexed by Google, there did not appear to be a lingering punishment.

Analysis:

The title tag is important. When a user is looking at a search results page, the first thing they look at to decide if they are going to visit your site is the title that Google presents, which more often than not is your title tag.  If your title is a good concise title which matches the searchers intent, the searcher is more likely to click to your page.  You must remember that your SEO funnel does not start when a user visits your site from a search engine, it starts when a user sees your result on a search engine result page.

When we started to analyze exactly why the titles underperformed so badly, the first thing we did was look at how the titles looked on the search results.  We were surprised to see that the listings looked like this:

New Google Search Result Listing

Google completely ignored our <title> tag and instead created a composite title, constructed via an unknown method.  For example, the phrases “House Cleaning San Francisco” and “| Thumbtack” did not appear anywhere on the page, yet that was the title on the SERP.  So then the question became, did our search volume drop because our rankings went down or because people were clicking less?  In a way these are connected long term in that we expect that Google would demote results that do not get clicked often, but in the short term we don’t believe this would make a difference.  So we checked the ranks of a few of the pages, and the rankings seemed relatively stable.  We believe the drop in traffic was mainly due to going from a (relatively) good title to a bad title, and that was worth 20-30% of search clicks.

Then we wondered why Google chose to ignore our new titles.  Our guess is that Google found the new title to be a little fishy due to the usage of the words free or today, and the algorithm decided to replace it instead of polluting their search results.  From our perspective we felt the title was very accurate, in that we do offer people free quotes on service jobs in a timely manner, but from an algorithmic perspective we understand that it would be difficult to differentiate that at scale.

Tips:

  1. Is Google using your <title>s?  Search for your pages in Google and see what shows up as the title of your listing.  Is it the <title> of your page?  If not and you feel it is worse than your title, try to figure out why Google isn’t using your title.  Make sure the keywords in the title are relevant to your page and you avoid spammy looking words.
  2. Test your titles.  Try to come up with a few variations of your titles and test them out on your site.  If your site gets indexed relatively often, you should be able to see results relatively quickly.  Remember: this isn’t about increasing your ranking, this is about getting the most out of the rankings you have.  You are leaving traffic on the table if you don’t have a great title.
  3. Monitor your bounce rates.  Increasing the CTR of your search results at the expense of a higher bounce rate is a bad trade-off.  Google tracks when people bounce from your page, and if people are bouncing a lot that is a negative signal that your listing is not what the user is searching for, and will hurt you in the long run.  Make sure your title doesn’t mislead users into clicking.
————–
Are you an engineer?  Thumbtack is an awesome place to work and is hiring – check out our job listing here.

Googlebot makes POST requests via AJAX

Googlebot is constantly evolving to better capture the web’s content. Over the past few years we’ve seen Googlebot submit GET forms and execute JavaScript. But we’ve always taken it for granted that Googlebot would never execute a POST request, nor would any other well-behaved web crawler.

We were wrong about that. Recently, we started observing Googlebot making POST requests to thumbtack.com. As far as we can tell, such requests have not been openly observed before. These Apache access log excerpts show a few examples:

66.249.71.47 - - [04/Sep/2011:04:53:52 +0000] "POST /act/site/clienterror HTTP/1.1" 200 36 "http://www.thumbtack.com/ma/malden/dog-walking/dog-walking-and-pet-care-services" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

66.249.72.198 - - [25/Sep/2011:04:27:50 +0000] "POST /act/site/clienterror HTTP/1.1" 200 36 "http://www.thumbtack.com/ca/solana-beach/wedding-photographers/photography-cary-pennington-photography" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

66.249.72.207 - - [04/Oct/2011:09:53:08 +0000] "POST /act/site/clienterror HTTP/1.1" 200 36 "http://www.thumbtack.com/tx/san-antonio/painting/residential-commercial-construction-services" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

We’ve verfied the requests are coming from real Google crawler IP addresses:

$ dig -x 66.249.71.47 +short
crawl-66-249-71-47.googlebot.com.
$ dig crawl-66-249-71-47.googlebot.com. +short
66.249.71.47

The source of the requests is our client-side JavaScript error tracking code, which installs a global JavaScript error handler and attempts to POST to our server when unhandled errors are detected on the client. The requests from Googlebot include traceback information, so it appears the code was genuinely executed and not simply parsed to extract links.

Now, this isn’t necessarily harmful behavior. In discussing request safety, RFC 2616 sec. 9.1.1 states:

The important distinction here is that the user did not request the side-effects, so therefore cannot be held accountable for them.

In this case, the JavaScript code makes an unprompted POST request upon page load, not resulting from any user action. One might say that the request fits the above definition and is therefore safe, regardless of the request method. We conclude simply that this is a interesting new feature of Googlebot and one that webmasters should be aware of.

Visualization candy: the making of a realtime geo-dashboard

Thumbtack service directory map

“The most efficient graphic constructions are those in which any question, whatever its type and level, can be answered in a single instant of perception, that is, in a single image.”
- Jacques Bertin, 1983, Semiology of Graphics

Since Thumbtack’s inception, we have aimed to service every city, town, and neighborhood across the country. Unlike other startups that begin in one city and build out to other cities, our goal has been to grow quickly by being available in every town and city from the start. We want to be the most comprehensive and local services directory, whether you’re in Los Angeles, California or Monowi, Nebraska*.

Scanning our access logs and database tables gave us a strong inkling that we were reaching people all around the country. But creating maps gave us definitive visual proof that we were achieving our goals on the national level. The map above shows the geographic breadth of service professionals who have listed their businesses on Thumbtack. We’re proud to say it looks much like a map of cities at night, and it also shows us where we can focus more of our attention.

More than a cumulative snapshot of where Thumbtack has been up to a given point in time, we wanted to see a more dynamic portrait of Thumbtack and its geography. To help us out, we built a realtime mapping toolkit called Rotary Maps.

Realtime visualization


Our implementation shows us how users are interacting with Thumbtack: page views, email clicks, form submissions, etc. Because all these actions are tracked as event documents in MongoDB, they can be easily queried ad-hoc.

MaxMind’s excellent GeoIP database helps us determine where events are happening, so latitude and longitude coordinates are embedded into each event document. Querying Mongo for the geographic coordinates of specific event types is therefore trivial. We then render visualizations with Rotary Maps. The Rotary Maps library merges RaphaelJS vector drawing with Google Maps. This makes it possible to use custom vector icons to show different event types while using AJAX calls to refresh our maps with the latest data.

Open source treats

If you want to implement your own Rotary Map, we’ve released the Javascript source under an MIT license. Check out Rotary Maps on GitHub. This is an early release, but we hope it will be useful.

The following example shows the latest USGS earthquake data.

Looking at your data through a geographic lens is always interesting, and often insightful. We hope you’ll find Rotary Maps useful for your own visualizations, and happy geo-hacking!


* We do, in fact, have services available to Elsie Eiler, the sole inhabitant of the nation’s smallest city.

Three string functions every PHP project needs

At Thumbtack, we do most of our work in PHP and Python.  Our website is written in PHP, and much behind-the-scenes is written in Python.  Because we are constantly working in both languages, we often run into features of one language that we wish we could have in the other.  In particular there were 3 features of Python that we really wanted in PHP: string slicing, string startswith, and string endswith.  We recently wrote equivalent functions in PHP and added them to our source tree, and in that time they have been used many times, saving time and increasing code readability.

Feel free to use any or all of these functions in your own project.

Building our own tracking engine with MongoDB

There are few things we love more than understanding how people are using Thumbtack. Whenever possible, we use direct interactions with our users to learn about their experience. We perform usability tests, in which members of our design team will sit down with a user, ask them to do something on Thumbtack, and watch how they accomplish it. We call users to ensure they’re having a good experience, and ask them for ways they think we can improve. And users who have trouble with a part of the site will contact us directly.

These interactions provide us with great information on a micro scale: the experience of one user using one particular part of our site. But, in order for us to decide where to focus our design attention, and where to run these time- and labor-intensive usability tests, we need good data on a macro scale about what is happening on Thumbtack.

But we found that there was no single place we could turn to that had all the data we needed. We use Google Analytics to track how users come into our site and move from page to page, but it (and similar tools) works mostly on page views, and defines its own concept of unique users and sessions. This makes it poorly suited for learning about:

  • interactions between users, where an action taken by one will influence the other,
  • long-running interactions that span different browser sessions, and possibly different computers with different tracking cookies, and
  • interactions that use email instead of the website, which Google Analytics is completely blind to.

We could partially reconstruct some of those interactions by looking at our database, but we couldn’t tie the reconstructed data back to the things that Google Analytics did do well: providing information about where users came from, how long they spent on certain tasks, and how they navigated the site.

The most important interactions on Thumbtack are the ones that Google Analytics is not a good fit for. So, we decided to build an analytics system that natively works with arbitrary events instead of page views, and lets us link together events that happen across sessions, or outside the browser entirely.

Events

No matter how complex it is, any interaction on Thumbtack can be decomposed into one or more discrete events: individual actions taken by a user, or by Thumbtack itself. Whenever an interesting event happens, we record information about it to a database.

For example, when we match a request for a service with a service provider who could fulfill it, we send the provider an email notification. When that happens, we create an event record with this data (rendered as YAML):

Each event record can contain complex metadata, making our event collection an analytics treasure trove. We can (and do) use these email/send events to monitor email volume, and to keep a log of the emails we send to each user. But the real power of these events lies in their ability to be correlated with other events that are part of the same interaction.

When the service provider that got the “new request” email clicks the link inside it to the new request, we create another event record:

By correlating the earlier email/send events with request/view events that have a matching email ID, we can start building what we call a pathway: a series of steps that a user goes through to complete a larger task. (If you’re familiar with the term funnel tracking, this is a broader concept: not all of our pathways have one single goal at the end.) When we correlate all the events that make up a pathway, we get a view of what happened on every step. We can see how long it takes people to complete each step, how they interact with each step, and if there are any steps where people appear to give up. And, because we use persistent database ID’s to correlate events, we can do so perfectly, even if a user performs a step a month after starting the pathway, or from a different computer.

Recording interface

Thumbtack now tracks 120 different types of events, including page views, logins, profile updates, messages, searches, and website deploys. In order to do that much tracking manageably, we need a simple code interface for recording events.

The above request/view event is generated by this PHP statement:

The track_http_event function is a helper for recording events that happen in response to an HTTP request. It transparently adds all the common data we want to record about HTTP events, which is all the data you see under http, source, and identifiers. For events that happen in background processes, there is a basic track_event function that only records the event type, time, and additional data passed in as an array.

Event recording is completely ad-hoc: there is no central registry of event types, or of the fields that must be present on events. This makes it harder to see what all the event types are, and what data they carry; but in return, it’s extremely easy for any engineer to begin recording something interesting. All they need to do is add one function call. We deal with the ad-hocness by querying our database to find out what event types have been used, and using grep or ack tells us where in our codebase they get recorded.

Storage

Early on in the design process of our tracking system, it became apparent that a relational database would be a poor choice for storing events. With their rigid table schemas, they are the opposite of the flexible, ad-hoc system we wanted. We wanted a database that would let us dump arbitrary objects into it, and be able to query them easily to do analysis.

MongoDB fits that description nicely. For readers not familiar with it, Mongo is a document database that lets you store, retrieve, and modify arbitrarily-structured JSON objects. It’s similar to a traditional SQL database in that it works with discrete records that you can insert, query, update, and delete, but it doesn’t enforce a schema on the records you insert, and lets them be arbitrarily complex. And, importantly, it’s mature, and has corporate backing.

Events naturally fit into Mongo’s document model, since they are self-contained, structured objects. Not having to specify a schema for events is a big win, since different events that we track use wildly different sets of fields. We also often add or remove fields as our data needs change, and it’s extremely convenient to be able to simply make that change on the frontend without needing to perform an expensive ALTER TABLE, or worry about data migrations. In contrast to other document databases like CouchDB, it lets you find documents by writing a query (something our business people can do), instead of writing a pair of map-reduce functions (something they cannot).

Mongo also has one performance aspect that makes it very attractive to use for tracking. Most of the tracking we do occurs in our server-side controller code, after a page has been rendered, but before it is sent to the browser. It’s imperative that the tracking process doesn’t impact the performance of our website, so our tracking calls must return as quickly as possible.

Luckily, the MongoDB protocol gives client code full control over the trade-off between speed and guaranteed consistency. In the default mode of the PHP driver, inserts are simply fired off over TCP, and the insert call returns as soon as the message is received by the MongoDB host computer. It does not wait for any response from MongoDB itself indicating whether or not the insert actually worked. (Compare PostgreSQL, whose default behavior is to not return from an implicit or explicit commit until the modified data has actually been written to disk, which is very expensive.)

For tracking, this is exactly what we want. Adding tracking to our site had no measurable effect on page load times, and the safety we’re giving up in return is very small. Since the database isn’t doing any validation of our events, the only likely reason an insert would fail would be some sort of systemic problem affecting our MongoDB server, which would show up in our monitoring systems. (If MongoDB does go down entirely, our website tracking code simply logs a warning, and continues handling the incoming request.)

Conclusion

Overall, we’ve been very happy with our tracking system since launching it in November, 2010. It’s recorded 15GB of data, and over the past 30 days, over 200,000 events per day on average (2.4 events per second). It’s enabled us to report more useful metrics to ourselves, our users, and our investors; to decide where our time is best spent making improvements; to accurately measure the impact of design changes to our website; and even to find the causes of bugs.

Graph of events recorded per day since January 2011

But we’ve glossed over how we actually correlate related events, and perform useful analysis on them. In upcoming posts, we’ll describe the system we built to do event correlations in near-real time, and produce reports from the data we collect.

How database replication helps me sleep at night

Early Christmas Morning, around 2 A.M., I was on vacation with my family in St. Thomas.  I had just fallen asleep when I was woken with a phone call.  It was my coworker Steve. “The site is down,” he said. I pulled out my laptop, checked the website, and confirmed. This is a spot everybody in operations has been in before, and I immediately had that sinking feeling in my stomach and assumed the worst… all the hard drives spontaneously exploded and we have complete data loss. Of course we take regular backups every few hours, but rebuilding our production server and restoring it from backup would not be a fun task, especially on slow Atlantic island internet.

So first I needed to assess the damage, and attempted to SSH into the box.  Success! A beautiful sight when your website is down, a bash prompt. The machine was working, all data was intact, but the filesystem had become read-only, causing the database to crash and the website to break. The filesystem had become read-only because one of the drives in our RAID-1 had crashed. The fix was quite simple: we just kicked the filesystem back into read-write mode, and everything was back up in a few minutes. We rebuild the RAID array in the background, and we were back up and running at full strength within a few days. However, I had trouble forgetting that sinking feeling, and when I got back to work on Monday, increasing the redundancy of our data was the top priority. My goals were the following:

  • Minimize the data loss from losing our database server
  • Be able to recover from losing our database server in under an hour

Option 1: Increase frequency of database backups

The first option we looked at was increasing the frequency of database backups. Previously we had taken backups every 6 hours, but a possible solution would be to increase the frequency of these backups to hourly. Pros:

  • In the worst case we lose one hour of data.
  • In case of DROP TABLE x accidents, we always have recent backups

Cons:

  • In order to restore from a failure, we would need to do a full download and restore of our database on a fresh database machine. In the best case this could take ~20 minutes, in the worst case it could take much longer.
  • If we want to keep multiple backups, we will need a lot of storage space
  • Full database backups are expensive. Every row of the database must be read into memory, not only causing a lot of disk I/O but also messing with our page caches.
  • We will use a lot of bandwidth to continually transfer our database backups off-site, or need to use expensive redundant storage from our hosting provider

Option 2: Use “file-based log shipping” database replication

At the time we had been using Postgres 8.3, which offered file-based log shipping replication. This means that you run one master database and multiple standby databases, and as the master database finishes writing its write-ahead log files, the log files are stored where all the standby servers can access them. The standby servers then load the files, read and replay the operations, and they contain a relatively up to date version of the master database. If the master database fails, you can promote one of your standby servers to the master server, and you are back up and running. Pros:

  • In the worst case we lose X MB of write data, where X is the size of our write-ahead log files
  • In case of master database failure, promotion of a standby machine can happen very quickly (<5 minutes)

Cons:

  • Since data is only replicated after a full write-ahead log file has been completed, in cases of low write traffic the standby machines could be many minutes to hours behind
  • The standby machine is a “warm standby,” meaning it cannot be queried against and can only be used if it is promoted to the master
  • You must have a shared location where logs are shipped so they can be read by standby machines, such as an NFS mount.
  • In case of DROP TABLE x accidents, there is no recovery

Options 3: Use “streaming replication” database replication

Postgres 9.0, the next major release of Postgres, included a new feature called streaming replication.  Like file-based log shipping, it works by having the standbys read the write-ahead log; but instead of applying the log entries in large batches from completed log files, they connect to the master and receive the entries as they happen. In addition, the capabilities of the standby machines were upgraded in this release, going from “warm standby” machines to “hot standby” machines. This new feature means that the standby machines can be queried and act as read-only replicas of the master database. In addition, because the standby machines are replaying transaction data from the master database, the standby machines are guaranteed to always have a consistent database state. Pros:

  • In the worst case we lose X seconds of data, where X is the replication delay between master and standby machines
  • Hot standby machine can be used as a read-only slave
  • In case of master database failure, promotion of standby machine can happen very quickly (<5 minutes)
  • In case of master database failure, fallback to read-only mode can happen instantly by our frontend machines using the standby machine

Cons:

  • In case of DROP TABLE x accidents, there is no recovery

Option 4: Best of all worlds

Thumbtack uses a combination of Option 1 and Option 3. We upgraded our database from Postgres 8.3 to Postgres 9.0 mainly so we could use streaming replication. Currently, we have our master database and one standby machine, which is used heavily as a read-only slave for things like analytics queries. In addition, we take multiple regular full backups of our database every few hours and store them off-site, so in case a bad query gets run on our database or there is a catastrophic datacenter issue, we always have fresh backups.

Of course, you cannot rely on anything unless you monitor it, and replication only works if it doesn’t break down. Thus, I wrote a daemon which queries the standby and master machines to determine the replication lag and exports that information to Munin.  This not only allows us to view the replication lag over time, but also lets us set alerts if the replication lag gets too high.  Here is a graph of the replication lag over the last week:

As you can see, the replication lag between the machines is no more than 5 seconds, which is more than acceptable for our needs. For those really paranoid about data loss, Postgres 9.1 is going to release a feature known as synchronous replication, in which a database commit will not return as a success until it is committed by both the master and the standby machines, thus effectively reducing the replication lag between servers to 0.  However, this comes at a performance cost in that all of your commits must be shipped to two places instead of one and for our needs, the possibility of losing a few seconds of data is worth the higher performance of asynchronous replication.

As I learned from the Christmas day incident, machine failures are going to happen at the worst times, but thanks to streaming replication and regular backups I know that the next time a machine fails we are going to be well prepared to recover with minimal data loss, and that helps me sleep at night.