Running PostgreSQL on a single primary master node is simple and convenient. There is a single source of truth, one instance to handle all reads and writes, one target for all clients to connect to, and only a single configuration file to maintain. However, such a setup usually does not last forever. As traffic increases, so does the number of concurrent reads and writes, the read/write ratio may become too high, a fast and reliable recovery plan needs to exist, the list goes on…
No single approach solves all possible scaling challenges, but there are quite a few options for scaling PostgreSQL depending on the requirements. When the read/write ratio is high enough, there is fairly straightforward scaling strategy: setup secondary PostgreSQL nodes (replicas) that stream data from the primary node (master) and split SQL traffic by sending all writes (INSERT, DELETE, UPDATE, UPSERT) to the single master node and all reads (SELECT) to the replicas. There can be many replicas, so this strategy scales better with a higher read/write ratio. Replicas are also valuable to implement a disaster recovery plan as it’s possible to promote one to master in the event of a failure.
In 2014, Thumbtack was running PostgreSQL 9.1 on two servers: a basic master – slave setup leveraging PostgreSQL’s built-in streaming replication. Our infrastructure was comprised of a few dozen physical machines on SoftLayer running RHEL 5 and we were using HAproxy with Keepalived for load balancing. The future, already being planned for, would be powered by EC2 instances on AWS, running Debian 7 behind Elastic Load Balancers.
As traffic grew, we knew we would need to scale out PostgreSQL further. Thumbtack’s SQL traffic was (and still is) quite read-intensive, with less than 3% of all queries being executed on the master node. This was good news as it meant we could scale out by sending SELECT statements to a cluster of read-only replicas and leaving the master alone to process DML commands.
In order to properly implement this we would need:
- an arbitrary number of read-only replicas behind a load balancer;
- the load balancer itself could not be a single point of failure;
- a way of performing health checks on each server, executed from the load balancer, so that failed nodes would be taken in and out of rotation automatically;
- to support SoftLayer and AWS environments during the transition period.
Replication, high-availability, and load-balancing
We knew what we wanted the infrastructure to look like from a high-level perspective and had the tools available to implement almost all of it on both providers (Fig. 1).
One critical detail, however, was far from being a solved problem: health checks.
A basic ping on port 5432 was not enough. Performance and replication lag were (and still are!) very important factors to us — if a given replica is lagging behind by more than N seconds, we prefer not to use it until it recovers as it would otherwise lead to stale reads.
Custom health checks
Not having found an open source tool that implements powerful enough health-checks for PostgreSQL, we decided to write our own. These were the requirements:
- Work equally well on both environments — RHEL 5/HAproxy on Softlayer and Debian 7/ELBs on AWS;
- Check basic TCP connectivity, on an arbitrary port, with a configurable timeout;
- Check server availability by running a test query with a time limit — if a server is under load, it may be responding to TCP but not able to process a simple query (SELECT 1). We need to distinguish between these two scenarios, and potentially take different actions;
- Check replication lag (time elapsed since the last transaction was replayed);
- Support custom health checks in the form of SQL queries — extensible and future-proof;
- Low memory footprint — avoid “stealing” memory from PostgreSQL;
- Minimal list of external dependencies.
A web service, exposing a simple HTTP endpoint, would work in any environment and easily be able to test TCP connectivity. Simple queries and testing replication lag are just a special case of running arbitrary SQL queries as a health check, so we just focused on this one and implemented the others as a form of syntactic sugar.
One important decision for delivering a platform independent solution with low memory footprint and minimal dependencies was the choice of the programming language. We considered a few from Python (there was already a reasonably large Python code base at Thumbtack), to Go (we were taking our first steps with it), and even Rust (too immature at the time).
We ended up writing it in C. It was easy to meet all requirements with only one external dependency for implementing the web server, clearly no challenges running it on any of the Linux distributions we were maintaining, and arguably the implementation with the smallest memory footprint given the choices above.
The final result
We named the project pgDoctor and made it publicly available on our Github repository.
It uses microhttpd to implement a very simple web service that listens on port 8071, logs to the local7 syslog facility (configurable), and provides a reasonably rich set of configuration parameters. The behavior is quite simple: an HTTP GET request to :8071 returns 200 if all checks pass, 500 otherwise. All errors are logged.
pgDoctor has been running flawlessly on all our PostgreSQL replicas for roughly 3 years now, having gone through two major upgrades (9.1 –> 9.4 –> 9.6). As of now, there are 18 streaming replicas, all running pgDoctor alongside PostgreSQL, and distributed among 4 clusters. Each cluster supports different use cases and requires slightly different health checks.
PostgreSQL replicas are sometimes taken out of rotation. The most common reasons are temporary high replication lag or some transient issue with the underlying EC2 instance. As expected, they are added back to the cluster without any intervention once normality is restored and the health checks succeed.
Fig. 2 shows a diagram of (a downsized version of) our production environment:
- Three availability zones;
- One master node and two hot-standby instances on different availability zones;
- Three clusters of read-only replicas, streaming from the master, each with its own load balancer;
- Several clients, on all availability zones, reading from one or more clusters and writing to the master.
Does this sound interesting? There is a lot more to be done. Join Thumbtack and help us build, scale, and operate a high reliability service!