There are few things we love more than understanding how people are using Thumbtack. Whenever possible, we use direct interactions with our users to learn about their experience. We perform usability tests, in which members of our design team will sit down with a user, ask them to do something on Thumbtack, and watch how they accomplish it. We call users to ensure they’re having a good experience, and ask them for ways they think we can improve. And users who have trouble with a part of the site will contact us directly.
These interactions provide us with great information on a micro scale: the experience of one user using one particular part of our site. But, in order for us to decide where to focus our design attention, and where to run these time- and labor-intensive usability tests, we need good data on a macro scale about what is happening on Thumbtack.
But we found that there was no single place we could turn to that had all the data we needed. We use Google Analytics to track how users come into our site and move from page to page, but it (and similar tools) works mostly on page views, and defines its own concept of unique users and sessions. This makes it poorly suited for learning about:
- interactions between users, where an action taken by one will influence the other,
- long-running interactions that span different browser sessions, and possibly different computers with different tracking cookies, and
- interactions that use email instead of the website, which Google Analytics is completely blind to.
We could partially reconstruct some of those interactions by looking at our database, but we couldn’t tie the reconstructed data back to the things that Google Analytics did do well: providing information about where users came from, how long they spent on certain tasks, and how they navigated the site.
The most important interactions on Thumbtack are the ones that Google Analytics is not a good fit for. So, we decided to build an analytics system that natively works with arbitrary events instead of page views, and lets us link together events that happen across sessions, or outside the browser entirely.
No matter how complex it is, any interaction on Thumbtack can be decomposed into one or more discrete events: individual actions taken by a user, or by Thumbtack itself. Whenever an interesting event happens, we record information about it to a database.
For example, when we match a request for a service with a service provider who could fulfill it, we send the provider an email notification. When that happens, we create an event record with this data (rendered as YAML):
Each event record can contain complex metadata, making our event collection an analytics treasure trove. We can (and do) use these
email/send events to monitor email volume, and to keep a log of the emails we send to each user. But the real power of these events lies in their ability to be correlated with other events that are part of the same interaction.
When the service provider that got the “new request” email clicks the link inside it to the new request, we create another event record:
By correlating the earlier
email/send events with
request/view events that have a matching email ID, we can start building what we call a pathway: a series of steps that a user goes through to complete a larger task. (If you’re familiar with the term funnel tracking, this is a broader concept: not all of our pathways have one single goal at the end.) When we correlate all the events that make up a pathway, we get a view of what happened on every step. We can see how long it takes people to complete each step, how they interact with each step, and if there are any steps where people appear to give up. And, because we use persistent database ID’s to correlate events, we can do so perfectly, even if a user performs a step a month after starting the pathway, or from a different computer.
Thumbtack now tracks 120 different types of events, including page views, logins, profile updates, messages, searches, and website deploys. In order to do that much tracking manageably, we need a simple code interface for recording events.
request/view event is generated by this PHP statement:
track_http_event function is a helper for recording events that happen in response to an HTTP request. It transparently adds all the common data we want to record about HTTP events, which is all the data you see under
identifiers. For events that happen in background processes, there is a basic
track_event function that only records the event type, time, and additional data passed in as an array.
Event recording is completely ad-hoc: there is no central registry of event types, or of the fields that must be present on events. This makes it harder to see what all the event types are, and what data they carry; but in return, it’s extremely easy for any engineer to begin recording something interesting. All they need to do is add one function call. We deal with the ad-hocness by querying our database to find out what event types have been used, and using grep or ack tells us where in our codebase they get recorded.
Early on in the design process of our tracking system, it became apparent that a relational database would be a poor choice for storing events. With their rigid table schemas, they are the opposite of the flexible, ad-hoc system we wanted. We wanted a database that would let us dump arbitrary objects into it, and be able to query them easily to do analysis.
MongoDB fits that description nicely. For readers not familiar with it, Mongo is a document database that lets you store, retrieve, and modify arbitrarily-structured JSON objects. It’s similar to a traditional SQL database in that it works with discrete records that you can insert, query, update, and delete, but it doesn’t enforce a schema on the records you insert, and lets them be arbitrarily complex. And, importantly, it’s mature, and has corporate backing.
Events naturally fit into Mongo’s document model, since they are self-contained, structured objects. Not having to specify a schema for events is a big win, since different events that we track use wildly different sets of fields. We also often add or remove fields as our data needs change, and it’s extremely convenient to be able to simply make that change on the frontend without needing to perform an expensive
ALTER TABLE, or worry about data migrations. In contrast to other document databases like CouchDB, it lets you find documents by writing a query (something our business people can do), instead of writing a pair of map-reduce functions (something they cannot).
Mongo also has one performance aspect that makes it very attractive to use for tracking. Most of the tracking we do occurs in our server-side controller code, after a page has been rendered, but before it is sent to the browser. It’s imperative that the tracking process doesn’t impact the performance of our website, so our tracking calls must return as quickly as possible.
Luckily, the MongoDB protocol gives client code full control over the trade-off between speed and guaranteed consistency. In the default mode of the PHP driver, inserts are simply fired off over TCP, and the insert call returns as soon as the message is received by the MongoDB host computer. It does not wait for any response from MongoDB itself indicating whether or not the insert actually worked. (Compare PostgreSQL, whose default behavior is to not return from an implicit or explicit commit until the modified data has actually been written to disk, which is very expensive.)
For tracking, this is exactly what we want. Adding tracking to our site had no measurable effect on page load times, and the safety we’re giving up in return is very small. Since the database isn’t doing any validation of our events, the only likely reason an insert would fail would be some sort of systemic problem affecting our MongoDB server, which would show up in our monitoring systems. (If MongoDB does go down entirely, our website tracking code simply logs a warning, and continues handling the incoming request.)
Overall, we’ve been very happy with our tracking system since launching it in November, 2010. It’s recorded 15GB of data, and over the past 30 days, over 200,000 events per day on average (2.4 events per second). It’s enabled us to report more useful metrics to ourselves, our users, and our investors; to decide where our time is best spent making improvements; to accurately measure the impact of design changes to our website; and even to find the causes of bugs.
But we’ve glossed over how we actually correlate related events, and perform useful analysis on them. In upcoming posts, we’ll describe the system we built to do event correlations in near-real time, and produce reports from the data we collect.