At Thumbtack we use InfluxDB to store monitoring metrics collected from all of our systems. It currently handles the ingestion of more than 200,000 data points per second and with our current retention policies this adds up to 7.5TB of data.
We prefer to operate on open-source software, one of the reasons we were drawn to InfluxDB. However, the open source core does not provide high availability features.
As we grew, rolled out new features, and collected more and more metrics (the amount of which has almost doubled over the past 12 months), we started getting concerned with a single point of failure in our monitoring stack and decided to build a custom solution, tailored to our needs.
From a high level perspective, as per the diagram below, our monitoring stack is set up as follows:
- applications are instrumented to collect and deliver metrics to the storage layer, provided by InfluxDB;
- Grafana provides the dashboards for data visualization and exploration;
- Kapacitor streams data from InfluxDB to match patterns and process alerts.
High availability on InfluxDB
The open-source core of InfluxDB does not provide clustering/high availability functionalities. While clustering was not a hard requirement for us, we believed high availability to be necessary because even the best software eventually fails.
InfluxData has two commercial offerings available, InfluxEnterprise and InfluxCloud, that include proprietary, closed source, high availability and clustering functionality alongside other features and support from InfluxData. We decided not to pursue either option both because of our open-source preference and because of a mismatch between features and our specific need.
Another option we considered was the open-source InfluxDB Relay. Failure of one Relay or one InfluxDB can be sustained while still taking writes and serving queries. However, the recovery process might require operator intervention. It has also not been updated in 2 years, is not sufficient for long periods of downtime as all data is buffered in RAM, buffered data is lost if a Relay node fails, and when the buffer is full requests are dropped. During prolonged outages the buffer may also negatively impact the health of the Relay instance itself by adding memory pressure. Lastly, a health checker would have to be added to this setup in order to make sure nodes recovering from temporary failures do not respond to queries while the buffer is still being flushed — otherwise only partial data will be delivered, alerts might go off, etc.
When starting an in-house project to support InfluxDB’s resiliency, our main goals could be summarized as:
- to implement a mechanism to replicate our monitoring data to multiple InfluxDB instances;
- to ensure that the storage layer of our monitoring stack remains available even when individual nodes are down;
- to ensure the new architecture would allow us to scale out by launching multiple clusters supporting different databases;
- to maintain full compatibility with existing clients, for both writing and reading data.
In order to deliver a truly highly available service, we needed a way of writing metrics to an arbitrary number of InfluxDB nodes and distributing queries among all nodes. Provided we could build a tool to run reliable health check on individual node, a standard load balancer would be enough to address the latter. For the former we would have to build a mechanism to forward writes or replicate data.
The following table presents a summary of the techniques used to add high availability and failure recovery to InfluxDB.
|High availability for writes||Replay metrics to multiple, independent nodes|
|Temporary failures||Buffer payloads|
|Permanent failures||Backup restore + buffer payloads|
|Traffic spikes||Global and per database rate limiting|
The diagram below illustrates the final version of our monitoring stack’s storage layer, backed by InfluxDB, with the three major components we added:
- Bufferson, an in-house buffered asynchronous HTTP proxy;
- custom health checks, to enable automatic removal of InfluxDB nodes from rotation while recovering data;
- reverse proxies, backed by nginx, to limit the amount of HTTP requests a client can make in a given period of time.
Bufferson provides simple proxying capabilities for asynchronous buffered HTTP processing using a queue (currently Amazon Simple Queue Service only) for temporary, highly available storage. It also implements traffic shadowing, a core feature that enables us to forward HTTP requests to multiple InfluxDB nodes. It is formed by two independent components:
- replay — tries to forward HTTP requests directly to each upstream node, putting failed requests in a queue (buffer);
- recover — continuously processes the queue and tries to deliver the buffered requests.
- Generic HTTP asynchronous buffered proxy, not coupled to InfluxDB in any way.
- Both components scale horizontally.
- All writes are idempotent and commutative, so multiple queues can be used in parallel to make sure the buffer does not become a single point of failure.
- Data is lost if and only if the backend queue drops it.
- Requests may be delayed, but will always be delivered at least once.
- The HTTP request is replayed exactly as it was generated, nothing is added or removed:
- No extra layers;
- No added concerns, namely around security with usernames and passwords;
- All HTTP features, including basic authentication, are supported.
While write requests are sent to Bufferson, queries go through a load balancer which then forwards them to InfluxDB instances. This provides a simple and elegant way of distributing load among the pool of servers and handling the failure of isolated nodes, but it does require each server to expose an endpoint that can be used to evaluate its health.
InfluxDB exposes /ping, which is helpful to verify the service is up and running, but that alone is not enough to confidently put an instance back in rotation. In fact, we need to ensure that while a node is recovering from temporary failures and buffered data is still being flushed, it does not process any queries.
We address this by running a local daemon on each InfluxDB instance that performs two checks and returns 200 OK if and only if both succeed:
- a call to InfluxDB’s /ping endpoint;
- Bufferson reports the node is not recovering data.
Health check running locally that the load balancer uses to put the node on/off rotation for queries
Some clients have quite spiky access patterns and we wanted to ensure a reasonably smooth load distribution to avoid any problems with InfluxDB.
We run a periodic (every 10 minutes) background job that synchronizes, using rsync, InfluxDB’s data directory with a highly available network file system.
Backups are used, besides recovering from a major disaster, to bootstrap new nodes when expanding a cluster.
Temporary failures are handled by Bufferson and the health checker. When failing to forward requests to a given node, bufferson-replay will put those requests in the queue (each request includes the target hostname so that it is only replayed on the failed node); bufferson-recover continuously pulls items from that queue and will deliver them once the failed node is available again. The health checker ensures that node will not respond to any queries before all buffered data has been successfully delivered.
We have seen this happening multiple times, for brief periods of time (seconds). No human intervention has been necessary in the last 9 months of operation of this system.
The process of adding a new node is reasonably straightforward: launch the instance (we assume some automated process will bootstrap it and handle configuration management), add it to Bufferson (which will immediately start to buffer requests), restore a backup, start InfluxDB (at this point Bufferson will start delivering requests that were buffered while the backup was restored).
This process may take several hours, depending on the size of the data set, and can be optimized in a number of ways but that discussion is left out for the sake of brevity.
Conclusions and future work
We have implemented a highly available deployment of InfluxDB by leveraging an in-house HTTP proxy with buffering capabilities, standard load balancing, a custom health checker, and nginx’s rate limiting module. It has proven very effective at keeping the storage layer of our monitoring stack available during transient errors on the underlying infrastructure (disk drives, networking, etc.), prolonged outages of isolated nodes, and upgrades.
During a disaster recovery drill we purposely put one node offline for 6.5 hours causing Bufferson to queue more than 3 million HTTP requests, each containing up to 5000 data points. When that server was brought back online, it immediately started accepting write requests, and all buffered data points were successfully replayed. Once the recovery process completed, the load balancer put the node back online and it started to process queries. All of this happened with no impact on the service, the outage went unnoticed by clients, and no human intervention was required to bring the node back online.
While this approach does not fully solve the problem of unbounded horizontal scalability, it does provide linear scalability on reads by evenly distributing query load among all available nodes.
We have plans to open source Bufferson soon, so stay tuned!
Does working on this type of problem sound interesting? Are you interested in helping us scale this and other key components of our core infrastructure? We’re hiring!