SEO Tip: Titles matter, probably more than you think

As a preface, I want to mention that this post is not any sort of secret formula for SEO. The only way to succeed at SEO is to deliver relevant content to a user, and if you don’t do that you are not going to succeed.

However, like most things there are always opportunities for small optimizations. As part of our culture of testing as Thumbtack, we recently decided to test an often overlooked part of a webpage and see if it would impact search traffic: the title tag.  The reason we did this is because often times whatever you put in your title tag is what search engines will make the headline of your listing on a search results page.  Here is what a basic result for one of our pages looks like:

Basic Search Listing

If you visit that page, you will see that what we have in the title tag is exactly what Google has chosen to put as the title of the listing on the search result page.  Great!

The Test:

We decided to A/B test 3 different variations of the title tag and see what impact it would make.  Here are the variations we chose (with location and service type substituted for each specific page):

  • Looking for the best House Cleaning Services in San Francisco? (baseline)
  • Get Free Quotes Today From House Cleaning Services In San Francisco (quotes first variation)
  • House Cleaning Services in San Francisco – Get Free Quotes Today (quotes last variation)

To make sure the test would have enough data, we opted in a few thousand pages into each bucket, then set them loose.  Our pages are indexed somewhat often so we felt fairly confident that the results within a week or two would be significant, i.e., if there was no change in traffic within the first two weeks then we were confident that was because the buckets were equal, not because the changes had not been picked up.

The Results:

Getting results for a test like this is challenging because search traffic has a lot of variance in it.  Not only do you have to deal with differing traffic based on the day of the week, but in the background your search traffic might be going up and down from things outside of your control (i.e., algorithm changes).  So, to control for variance in search traffic, instead of getting our results from raw hit numbers, we instead looked at the ratio of hits from our experimental titles to hits from our baseline title.

The key dates on the graph are October 7th when we launched the test, and October 14th when we ended the test.

SEO Title Test Result

As you can see, the result came quickly and painfully. The alternate variations underperformed the baseline by 20-30%.  Remember, nothing else on the pages changed: not the content, not the H1 tag, not the meta description; the only change on the page was the wording between the <title> and the </title>.

After letting the test run for a week and with the results indisputable, we reverted the titles back to the baseline and hoped the traffic would return back to normal levels.  And as you can see in the above graph, for the most part the traffic did come back.  We clearly had made an SEO mistake, but once corrected and re-indexed by Google, there did not appear to be a lingering punishment.

Analysis:

The title tag is important. When a user is looking at a search results page, the first thing they look at to decide if they are going to visit your site is the title that Google presents, which more often than not is your title tag.  If your title is a good concise title which matches the searchers intent, the searcher is more likely to click to your page.  You must remember that your SEO funnel does not start when a user visits your site from a search engine, it starts when a user sees your result on a search engine result page.

When we started to analyze exactly why the titles underperformed so badly, the first thing we did was look at how the titles looked on the search results.  We were surprised to see that the listings looked like this:

New Google Search Result Listing

Google completely ignored our <title> tag and instead created a composite title, constructed via an unknown method.  For example, the phrases “House Cleaning San Francisco” and “| Thumbtack” did not appear anywhere on the page, yet that was the title on the SERP.  So then the question became, did our search volume drop because our rankings went down or because people were clicking less?  In a way these are connected long term in that we expect that Google would demote results that do not get clicked often, but in the short term we don’t believe this would make a difference.  So we checked the ranks of a few of the pages, and the rankings seemed relatively stable.  We believe the drop in traffic was mainly due to going from a (relatively) good title to a bad title, and that was worth 20-30% of search clicks.

Then we wondered why Google chose to ignore our new titles.  Our guess is that Google found the new title to be a little fishy due to the usage of the words free or today, and the algorithm decided to replace it instead of polluting their search results.  From our perspective we felt the title was very accurate, in that we do offer people free quotes on service jobs in a timely manner, but from an algorithmic perspective we understand that it would be difficult to differentiate that at scale.

Tips:

  1. Is Google using your <title>s?  Search for your pages in Google and see what shows up as the title of your listing.  Is it the <title> of your page?  If not and you feel it is worse than your title, try to figure out why Google isn’t using your title.  Make sure the keywords in the title are relevant to your page and you avoid spammy looking words.
  2. Test your titles.  Try to come up with a few variations of your titles and test them out on your site.  If your site gets indexed relatively often, you should be able to see results relatively quickly.  Remember: this isn’t about increasing your ranking, this is about getting the most out of the rankings you have.  You are leaving traffic on the table if you don’t have a great title.
  3. Monitor your bounce rates.  Increasing the CTR of your search results at the expense of a higher bounce rate is a bad trade-off.  Google tracks when people bounce from your page, and if people are bouncing a lot that is a negative signal that your listing is not what the user is searching for, and will hurt you in the long run.  Make sure your title doesn’t mislead users into clicking.
————–
Are you an engineer?  Thumbtack is an awesome place to work and is hiring – check out our job listing here.

Googlebot makes POST requests via AJAX

Googlebot is constantly evolving to better capture the web’s content. Over the past few years we’ve seen Googlebot submit GET forms and execute JavaScript. But we’ve always taken it for granted that Googlebot would never execute a POST request, nor would any other well-behaved web crawler.

We were wrong about that. Recently, we started observing Googlebot making POST requests to thumbtack.com. As far as we can tell, such requests have not been openly observed before. These Apache access log excerpts show a few examples:

66.249.71.47 - - [04/Sep/2011:04:53:52 +0000] "POST /act/site/clienterror HTTP/1.1" 200 36 "http://www.thumbtack.com/ma/malden/dog-walking/dog-walking-and-pet-care-services" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

66.249.72.198 - - [25/Sep/2011:04:27:50 +0000] "POST /act/site/clienterror HTTP/1.1" 200 36 "http://www.thumbtack.com/ca/solana-beach/wedding-photographers/photography-cary-pennington-photography" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

66.249.72.207 - - [04/Oct/2011:09:53:08 +0000] "POST /act/site/clienterror HTTP/1.1" 200 36 "http://www.thumbtack.com/tx/san-antonio/painting/residential-commercial-construction-services" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

We’ve verfied the requests are coming from real Google crawler IP addresses:

$ dig -x 66.249.71.47 +short
crawl-66-249-71-47.googlebot.com.
$ dig crawl-66-249-71-47.googlebot.com. +short
66.249.71.47

The source of the requests is our client-side JavaScript error tracking code, which installs a global JavaScript error handler and attempts to POST to our server when unhandled errors are detected on the client. The requests from Googlebot include traceback information, so it appears the code was genuinely executed and not simply parsed to extract links.

Now, this isn’t necessarily harmful behavior. In discussing request safety, RFC 2616 sec. 9.1.1 states:

The important distinction here is that the user did not request the side-effects, so therefore cannot be held accountable for them.

In this case, the JavaScript code makes an unprompted POST request upon page load, not resulting from any user action. One might say that the request fits the above definition and is therefore safe, regardless of the request method. We conclude simply that this is a interesting new feature of Googlebot and one that webmasters should be aware of.

Visualization candy: the making of a realtime geo-dashboard

Thumbtack service directory map

“The most efficient graphic constructions are those in which any question, whatever its type and level, can be answered in a single instant of perception, that is, in a single image.”
- Jacques Bertin, 1983, Semiology of Graphics

Since Thumbtack’s inception, we have aimed to service every city, town, and neighborhood across the country. Unlike other startups that begin in one city and build out to other cities, our goal has been to grow quickly by being available in every town and city from the start. We want to be the most comprehensive and local services directory, whether you’re in Los Angeles, California or Monowi, Nebraska*.

Scanning our access logs and database tables gave us a strong inkling that we were reaching people all around the country. But creating maps gave us definitive visual proof that we were achieving our goals on the national level. The map above shows the geographic breadth of service professionals who have listed their businesses on Thumbtack. We’re proud to say it looks much like a map of cities at night, and it also shows us where we can focus more of our attention.

More than a cumulative snapshot of where Thumbtack has been up to a given point in time, we wanted to see a more dynamic portrait of Thumbtack and its geography. To help us out, we built a realtime mapping toolkit called Rotary Maps.

Realtime visualization


Our implementation shows us how users are interacting with Thumbtack: page views, email clicks, form submissions, etc. Because all these actions are tracked as event documents in MongoDB, they can be easily queried ad-hoc.

MaxMind’s excellent GeoIP database helps us determine where events are happening, so latitude and longitude coordinates are embedded into each event document. Querying Mongo for the geographic coordinates of specific event types is therefore trivial. We then render visualizations with Rotary Maps. The Rotary Maps library merges RaphaelJS vector drawing with Google Maps. This makes it possible to use custom vector icons to show different event types while using AJAX calls to refresh our maps with the latest data.

Open source treats

If you want to implement your own Rotary Map, we’ve released the Javascript source under an MIT license. Check out Rotary Maps on GitHub. This is an early release, but we hope it will be useful.

The following example shows the latest USGS earthquake data.

Looking at your data through a geographic lens is always interesting, and often insightful. We hope you’ll find Rotary Maps useful for your own visualizations, and happy geo-hacking!


* We do, in fact, have services available to Elsie Eiler, the sole inhabitant of the nation’s smallest city.

Three string functions every PHP project needs

At Thumbtack, we do most of our work in PHP and Python.  Our website is written in PHP, and much behind-the-scenes is written in Python.  Because we are constantly working in both languages, we often run into features of one language that we wish we could have in the other.  In particular there were 3 features of Python that we really wanted in PHP: string slicing, string startswith, and string endswith.  We recently wrote equivalent functions in PHP and added them to our source tree, and in that time they have been used many times, saving time and increasing code readability.

Feel free to use any or all of these functions in your own project.

Building our own tracking engine with MongoDB

There are few things we love more than understanding how people are using Thumbtack. Whenever possible, we use direct interactions with our users to learn about their experience. We perform usability tests, in which members of our design team will sit down with a user, ask them to do something on Thumbtack, and watch how they accomplish it. We call users to ensure they’re having a good experience, and ask them for ways they think we can improve. And users who have trouble with a part of the site will contact us directly.

These interactions provide us with great information on a micro scale: the experience of one user using one particular part of our site. But, in order for us to decide where to focus our design attention, and where to run these time- and labor-intensive usability tests, we need good data on a macro scale about what is happening on Thumbtack.

But we found that there was no single place we could turn to that had all the data we needed. We use Google Analytics to track how users come into our site and move from page to page, but it (and similar tools) works mostly on page views, and defines its own concept of unique users and sessions. This makes it poorly suited for learning about:

  • interactions between users, where an action taken by one will influence the other,
  • long-running interactions that span different browser sessions, and possibly different computers with different tracking cookies, and
  • interactions that use email instead of the website, which Google Analytics is completely blind to.

We could partially reconstruct some of those interactions by looking at our database, but we couldn’t tie the reconstructed data back to the things that Google Analytics did do well: providing information about where users came from, how long they spent on certain tasks, and how they navigated the site.

The most important interactions on Thumbtack are the ones that Google Analytics is not a good fit for. So, we decided to build an analytics system that natively works with arbitrary events instead of page views, and lets us link together events that happen across sessions, or outside the browser entirely.

Events

No matter how complex it is, any interaction on Thumbtack can be decomposed into one or more discrete events: individual actions taken by a user, or by Thumbtack itself. Whenever an interesting event happens, we record information about it to a database.

For example, when we match a request for a service with a service provider who could fulfill it, we send the provider an email notification. When that happens, we create an event record with this data (rendered as YAML):

Each event record can contain complex metadata, making our event collection an analytics treasure trove. We can (and do) use these email/send events to monitor email volume, and to keep a log of the emails we send to each user. But the real power of these events lies in their ability to be correlated with other events that are part of the same interaction.

When the service provider that got the “new request” email clicks the link inside it to the new request, we create another event record:

By correlating the earlier email/send events with request/view events that have a matching email ID, we can start building what we call a pathway: a series of steps that a user goes through to complete a larger task. (If you’re familiar with the term funnel tracking, this is a broader concept: not all of our pathways have one single goal at the end.) When we correlate all the events that make up a pathway, we get a view of what happened on every step. We can see how long it takes people to complete each step, how they interact with each step, and if there are any steps where people appear to give up. And, because we use persistent database ID’s to correlate events, we can do so perfectly, even if a user performs a step a month after starting the pathway, or from a different computer.

Recording interface

Thumbtack now tracks 120 different types of events, including page views, logins, profile updates, messages, searches, and website deploys. In order to do that much tracking manageably, we need a simple code interface for recording events.

The above request/view event is generated by this PHP statement:

The track_http_event function is a helper for recording events that happen in response to an HTTP request. It transparently adds all the common data we want to record about HTTP events, which is all the data you see under http, source, and identifiers. For events that happen in background processes, there is a basic track_event function that only records the event type, time, and additional data passed in as an array.

Event recording is completely ad-hoc: there is no central registry of event types, or of the fields that must be present on events. This makes it harder to see what all the event types are, and what data they carry; but in return, it’s extremely easy for any engineer to begin recording something interesting. All they need to do is add one function call. We deal with the ad-hocness by querying our database to find out what event types have been used, and using grep or ack tells us where in our codebase they get recorded.

Storage

Early on in the design process of our tracking system, it became apparent that a relational database would be a poor choice for storing events. With their rigid table schemas, they are the opposite of the flexible, ad-hoc system we wanted. We wanted a database that would let us dump arbitrary objects into it, and be able to query them easily to do analysis.

MongoDB fits that description nicely. For readers not familiar with it, Mongo is a document database that lets you store, retrieve, and modify arbitrarily-structured JSON objects. It’s similar to a traditional SQL database in that it works with discrete records that you can insert, query, update, and delete, but it doesn’t enforce a schema on the records you insert, and lets them be arbitrarily complex. And, importantly, it’s mature, and has corporate backing.

Events naturally fit into Mongo’s document model, since they are self-contained, structured objects. Not having to specify a schema for events is a big win, since different events that we track use wildly different sets of fields. We also often add or remove fields as our data needs change, and it’s extremely convenient to be able to simply make that change on the frontend without needing to perform an expensive ALTER TABLE, or worry about data migrations. In contrast to other document databases like CouchDB, it lets you find documents by writing a query (something our business people can do), instead of writing a pair of map-reduce functions (something they cannot).

Mongo also has one performance aspect that makes it very attractive to use for tracking. Most of the tracking we do occurs in our server-side controller code, after a page has been rendered, but before it is sent to the browser. It’s imperative that the tracking process doesn’t impact the performance of our website, so our tracking calls must return as quickly as possible.

Luckily, the MongoDB protocol gives client code full control over the trade-off between speed and guaranteed consistency. In the default mode of the PHP driver, inserts are simply fired off over TCP, and the insert call returns as soon as the message is received by the MongoDB host computer. It does not wait for any response from MongoDB itself indicating whether or not the insert actually worked. (Compare PostgreSQL, whose default behavior is to not return from an implicit or explicit commit until the modified data has actually been written to disk, which is very expensive.)

For tracking, this is exactly what we want. Adding tracking to our site had no measurable effect on page load times, and the safety we’re giving up in return is very small. Since the database isn’t doing any validation of our events, the only likely reason an insert would fail would be some sort of systemic problem affecting our MongoDB server, which would show up in our monitoring systems. (If MongoDB does go down entirely, our website tracking code simply logs a warning, and continues handling the incoming request.)

Conclusion

Overall, we’ve been very happy with our tracking system since launching it in November, 2010. It’s recorded 15GB of data, and over the past 30 days, over 200,000 events per day on average (2.4 events per second). It’s enabled us to report more useful metrics to ourselves, our users, and our investors; to decide where our time is best spent making improvements; to accurately measure the impact of design changes to our website; and even to find the causes of bugs.

Graph of events recorded per day since January 2011

But we’ve glossed over how we actually correlate related events, and perform useful analysis on them. In upcoming posts, we’ll describe the system we built to do event correlations in near-real time, and produce reports from the data we collect.

How database replication helps me sleep at night

Early Christmas Morning, around 2 A.M., I was on vacation with my family in St. Thomas.  I had just fallen asleep when I was woken with a phone call.  It was my coworker Steve. “The site is down,” he said. I pulled out my laptop, checked the website, and confirmed. This is a spot everybody in operations has been in before, and I immediately had that sinking feeling in my stomach and assumed the worst… all the hard drives spontaneously exploded and we have complete data loss. Of course we take regular backups every few hours, but rebuilding our production server and restoring it from backup would not be a fun task, especially on slow Atlantic island internet.

So first I needed to assess the damage, and attempted to SSH into the box.  Success! A beautiful sight when your website is down, a bash prompt. The machine was working, all data was intact, but the filesystem had become read-only, causing the database to crash and the website to break. The filesystem had become read-only because one of the drives in our RAID-1 had crashed. The fix was quite simple: we just kicked the filesystem back into read-write mode, and everything was back up in a few minutes. We rebuild the RAID array in the background, and we were back up and running at full strength within a few days. However, I had trouble forgetting that sinking feeling, and when I got back to work on Monday, increasing the redundancy of our data was the top priority. My goals were the following:

  • Minimize the data loss from losing our database server
  • Be able to recover from losing our database server in under an hour

Option 1: Increase frequency of database backups

The first option we looked at was increasing the frequency of database backups. Previously we had taken backups every 6 hours, but a possible solution would be to increase the frequency of these backups to hourly. Pros:

  • In the worst case we lose one hour of data.
  • In case of DROP TABLE x accidents, we always have recent backups

Cons:

  • In order to restore from a failure, we would need to do a full download and restore of our database on a fresh database machine. In the best case this could take ~20 minutes, in the worst case it could take much longer.
  • If we want to keep multiple backups, we will need a lot of storage space
  • Full database backups are expensive. Every row of the database must be read into memory, not only causing a lot of disk I/O but also messing with our page caches.
  • We will use a lot of bandwidth to continually transfer our database backups off-site, or need to use expensive redundant storage from our hosting provider

Option 2: Use “file-based log shipping” database replication

At the time we had been using Postgres 8.3, which offered file-based log shipping replication. This means that you run one master database and multiple standby databases, and as the master database finishes writing its write-ahead log files, the log files are stored where all the standby servers can access them. The standby servers then load the files, read and replay the operations, and they contain a relatively up to date version of the master database. If the master database fails, you can promote one of your standby servers to the master server, and you are back up and running. Pros:

  • In the worst case we lose X MB of write data, where X is the size of our write-ahead log files
  • In case of master database failure, promotion of a standby machine can happen very quickly (<5 minutes)

Cons:

  • Since data is only replicated after a full write-ahead log file has been completed, in cases of low write traffic the standby machines could be many minutes to hours behind
  • The standby machine is a “warm standby,” meaning it cannot be queried against and can only be used if it is promoted to the master
  • You must have a shared location where logs are shipped so they can be read by standby machines, such as an NFS mount.
  • In case of DROP TABLE x accidents, there is no recovery

Options 3: Use “streaming replication” database replication

Postgres 9.0, the next major release of Postgres, included a new feature called streaming replication.  Like file-based log shipping, it works by having the standbys read the write-ahead log; but instead of applying the log entries in large batches from completed log files, they connect to the master and receive the entries as they happen. In addition, the capabilities of the standby machines were upgraded in this release, going from “warm standby” machines to “hot standby” machines. This new feature means that the standby machines can be queried and act as read-only replicas of the master database. In addition, because the standby machines are replaying transaction data from the master database, the standby machines are guaranteed to always have a consistent database state. Pros:

  • In the worst case we lose X seconds of data, where X is the replication delay between master and standby machines
  • Hot standby machine can be used as a read-only slave
  • In case of master database failure, promotion of standby machine can happen very quickly (<5 minutes)
  • In case of master database failure, fallback to read-only mode can happen instantly by our frontend machines using the standby machine

Cons:

  • In case of DROP TABLE x accidents, there is no recovery

Option 4: Best of all worlds

Thumbtack uses a combination of Option 1 and Option 3. We upgraded our database from Postgres 8.3 to Postgres 9.0 mainly so we could use streaming replication. Currently, we have our master database and one standby machine, which is used heavily as a read-only slave for things like analytics queries. In addition, we take multiple regular full backups of our database every few hours and store them off-site, so in case a bad query gets run on our database or there is a catastrophic datacenter issue, we always have fresh backups.

Of course, you cannot rely on anything unless you monitor it, and replication only works if it doesn’t break down. Thus, I wrote a daemon which queries the standby and master machines to determine the replication lag and exports that information to Munin.  This not only allows us to view the replication lag over time, but also lets us set alerts if the replication lag gets too high.  Here is a graph of the replication lag over the last week:

As you can see, the replication lag between the machines is no more than 5 seconds, which is more than acceptable for our needs. For those really paranoid about data loss, Postgres 9.1 is going to release a feature known as synchronous replication, in which a database commit will not return as a success until it is committed by both the master and the standby machines, thus effectively reducing the replication lag between servers to 0.  However, this comes at a performance cost in that all of your commits must be shipped to two places instead of one and for our needs, the possibility of losing a few seconds of data is worth the higher performance of asynchronous replication.

As I learned from the Christmas day incident, machine failures are going to happen at the worst times, but thanks to streaming replication and regular backups I know that the next time a machine fails we are going to be well prepared to recover with minimal data loss, and that helps me sleep at night.