At Thumbtack we run a few Python services via Gunicorn WSGI HTTP server. As those services are mostly I/O bound, gevent is a great choice for them. However, we have encountered a few issues with this combination, and would like to share the solutions we have found.
Lesson 1: Avoid multiprocessing.Queue
The first issue we encountered was noticing critical timeouts in our error log, something like:
[CRITICAL] WORKER TIMEOUT (pid:151)
As the logs didn’t contain anything else that would help us, we first needed a way to reliably reproduce the issue. We started by running the application in an isolated environment and sending traffic via ab. We were able to reproduce the error message after creating roughly 100 concurrent connections with ab. Then, we started stripping features out of the application, from the top down. Distributed tracing, authentication, pieces of routing logic, pieces of view controller logic, we kept stripping out pieces but were still able to reproduce the error. Until we removed calls to our monitoring library. Digging deeper into the code, we found it was relying on multiprocessing and multiprocessing.Queue, and found the following documented in gevent changelog 1.1:
Use of multiprocessing.Queue when thread has been monkey-patched will lead to a hang due to Queue’s internal use of a blocking pipe and threads. For the same reason, concurrent.futures.ProcessPoolExecutor, which internally uses a Queue, will hang.
The solution was to refactor the library, removing usage of multiprocessing and using the standalone (non-multiprocessing) queue library instead. We also added alerting on the “critical timeout” message in our logs, via Elastalert, so we’re on top of the issue as soon as possible if it happens again.
Lesson 2: Look out for anything modifying global state, even third-party libraries
The next gotcha was using Thrift with Gunicorn 19.7.1 or earlier, when rolling out a new feature using Thrift RPC. We noticed elevated number of 504 HTTP return codes (Gateway Timeout). The access logs also indicated that request time was below our threshold, strongly suggesting an issue with the application. We pinpointed the issue to a commit that contained Thrift RPC calls, but we weren’t able to reproduce it easily. We started testing in an isolated environment, but this time weren’t able to reproduce the 504 error. However, during the process we did notice that connections were not being properly shut down, which raised the possibility that our reverse proxy layer (NGINX) was responsible for the 504 errors.
After some research, we noticed the official Python Thrift client sets a global default timeout (socket.setdefaulttimeout), and unsets it as soon as the call is completed. However, when using gevent that would be picked up by another request (coroutine), resulting in unexpected behavior. For instance, connections being terminated before anything was sent to the client. We submitted a patch to Apache Thrift project, which was approved and merged. The Gunicorn project patched this issue also and released version 19.8.0 a few weeks after we identified this issue.
Gevent uses monkey patching so that standard Python libraries can be used. Importing some libraries before doing monkey patching can yield unexpected results. Gunicorn should automatically monkey patch early on if Gevent worker is used. However, you can still encounter some errors, like we did. Hence, it’s important to have a verbose logging and a robust monitoring infrastructure.
Should you encounter such problem, the best thing would be to look into the historical logs and metrics and pinpoint when did the first even happen. Should that not be possible, you could start with process of elimination – but only if you know how to reproduce the issue.
If you are excited about getting to the bottom of problems like these and contributing back to open source projects, join us!