When I joined at Thumbtack, back in late 2015, we had continuous delivery infrastructure for monolith builds. As more engineers joined, we noticed that a significant amount of time went to deploying the latest build. Moreover, there was a trend of having bigger deploys (so call train deploys) and rollbacks tend to be harder. It was a clear indicator we needed to invest into the deployment pipeline.
At Thumbtack, we use Gerrit for code review. As soon as a Code Review is submitted, Jenkins starts the build process. The source code is fetched from Gerrit’s Git repository and series of unit tests are executed. In the case when there’s a failure, the build process is aborted and the faulty build is discarded. In the case when unit tests pass, the next phase of the build is triggered—integration tests. Again, in the case of failure, the build is discarded. Otherwise, the build will be marked as safe to deploy and becomes available for deploy.
Continuous deployment scoping
We kept few things in mind when building continuous deployment:
- engineers should be able to take the full control over deployment (e.g. manually deploy specific version to canary/production)
- easy way to stop current deployment, and easy rollback process
- automated gradual rollout process
- prevent commit pile-up (more than 3-4 commits being rolled out) as it is easier to point what commit is faulty
- prevent commit submission during restricted time (9pm to 9am weekdays, all weekend and when office is closed. However, critical bug fixes are allowed)
Gerrit plugin to prevent Code Review submission
As mentioned above, Gerrit is used for code reviews and hosting Git repository. Historically, engineers would learn about restricted period during on-boarding process. In order to deploy code, one would have to look at specific Slack channel and check is there “lock” for monolith repository. Lock would be put manually when we enter restricted period (office off-hours) or when there’s rollback & revert in progress. However, we witnessed multiple times engineers would submit something while the monolith repository was “locked”.
We decided to build a Gerrit plugin that allows engineers to properly lock any repository. The Gerrit UI contains information about current status, and properly rejects commits from non-owner of the lock. An engineer can lock and unlock the repository by running a script which then runs Gerrit SSH command.
Moreover, we added a Jenkins job that locks the monolith repository automatically during off-hours.
Continuous deployment Jenkins job
After that, we introduced a new phase of build pipeline. It picks up the latest successful build, sets canary traffic to 0% and deploys build to canary cluster. Once deployed, we increase production traffic gradually over ~10 minutes. At that point, 100% of traffic goes to canary cluster and production cluster gets new build.
Engineers that submitted code are notified about changes going to out via Slack in deploy channel. In case there’s flood or spike of errors, an alert is triggered and posted to deploy Slack channel. If an engineer notices something is off with latest code (e.g. via automated alert, error log, metric dashboards), he/she can easily take a lock which breaks deployment. At that point, it’s up to an engineer to decide what to do – usually move all traffic off canary cluster, revert faulty commit and unlock once safe build is available.
Continuous deployment flow
This system has been in production for over a three months and we’re pretty happy with it. However, there’s always room for improvement:
- We want triggered automated alerts to stop the roll-out process – how to deal with false-positive, e.g. due networking issues
- reduce time between code submission and start of canary process – currently we build monolith code sequentially and can take over 20 minutes from submitting code to being rolled out to canary cluster
If any of those problems sound interesting to you, make sure to check our open positions.