Googlebot makes POST requests via AJAX

Googlebot is constantly evolving to better capture the web’s content. Over the past few years we’ve seen Googlebot submit GET forms and execute JavaScript. But we’ve always taken it for granted that Googlebot would never execute a POST request, nor would any other well-behaved web crawler.

We were wrong about that. Recently, we started observing Googlebot making POST requests to thumbtack.com. As far as we can tell, such requests have not been openly observed before. These Apache access log excerpts show a few examples:

66.249.71.47 - - [04/Sep/2011:04:53:52 +0000] "POST /act/site/clienterror HTTP/1.1" 200 36 "http://www.thumbtack.com/ma/malden/dog-walking/dog-walking-and-pet-care-services" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

66.249.72.198 - - [25/Sep/2011:04:27:50 +0000] "POST /act/site/clienterror HTTP/1.1" 200 36 "http://www.thumbtack.com/ca/solana-beach/wedding-photographers/photography-cary-pennington-photography" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

66.249.72.207 - - [04/Oct/2011:09:53:08 +0000] "POST /act/site/clienterror HTTP/1.1" 200 36 "http://www.thumbtack.com/tx/san-antonio/painting/residential-commercial-construction-services" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

We’ve verfied the requests are coming from real Google crawler IP addresses:

$ dig -x 66.249.71.47 +short
crawl-66-249-71-47.googlebot.com.
$ dig crawl-66-249-71-47.googlebot.com. +short
66.249.71.47

The source of the requests is our client-side JavaScript error tracking code, which installs a global JavaScript error handler and attempts to POST to our server when unhandled errors are detected on the client. The requests from Googlebot include traceback information, so it appears the code was genuinely executed and not simply parsed to extract links.

Now, this isn’t necessarily harmful behavior. In discussing request safety, RFC 2616 sec. 9.1.1 states:

The important distinction here is that the user did not request the side-effects, so therefore cannot be held accountable for them.

In this case, the JavaScript code makes an unprompted POST request upon page load, not resulting from any user action. One might say that the request fits the above definition and is therefore safe, regardless of the request method. We conclude simply that this is a interesting new feature of Googlebot and one that webmasters should be aware of.

  • http://twitter.com/wyred Erik Yeoh

    I’ve noticed this since Dec 2010 on one of my site’s AJAX feedback form.

    I kept receiving blank feedbacks so I decided to track the IP address to see what’s going on and found that it belonged to google’s crawler.

    Couldn’t think of a way to stop this, and I didn’t want to implement captcha too. So I just ignored it.

    • Jack

      Just out of curiosity, was the URL being POSTed to in the robots.txt?

      • Anonymous

        Jack, yes, it does seem that Googlebot respects robots.txt when issuing these AJAX requests. (in the case of the logs surfaced above we had not blocked the affected URLs)

      • http://twitter.com/wyred Erik Yeoh

        Nope, I did not use robots.txt for my site.

    • Ivan K.

      You could try to validate for presence of feedbacks before saving them.

  • Foo

    You can’t use AJAX from other domains. Maybe google crawl you links in javascript. Do you check if server variable HTTP_X_REQUESTED_WITH is set?

  • seokitty

    The real Google crawler sometimes are difficult to catch its track and content-capture style.