Over the 4 years I’ve been at Thumbtack, our engineering infrastructure has changed a lot. We’ve completely transitioned our cloud provider from SoftLayer to Amazon Web Services (AWS) & Google Cloud Platform (GCP), built our data infrastructure from the ground up, made big steps in migrating to backend services, built a model serving infrastructure, built our own push notification delivery service, migrated > 90% of our iOS codebase to Swift, built two Android apps, and, oh, completely overhauled how our marketplace works.
While we work hard to create systems that are simple, reliable, and robust, things will still go wrong. We’ve broken features when removing a feature flag, corrupted (and then fixed) our scheduling database, forgotten to renew certificates, and been attacked by a botnet. Our engineering team has also grown 5X since we started this process, meaning lots more code getting checked in every day and lots more opportunities to make mistakes.
This is typically where teams use some sort of retrospective or postmortem process and we’re no exception. For people new to authoring an incident postmortem, the task may seem hard as it feels like an admission that something went wrong. However, a postmortem is not intended to place blame on someone. It is instead a way for the people involved and the team to learn how to not repeat it again, how to recover if it happens in the future, and to plan future action items to mitigate the risks.
Over the last few years, we’ve authored about 150 postmortems and developed a good habit of regularly asking “does this incident need a postmortem?” after one occurs. I’m also happy to report that despite the team growth, the frequency of review meetings has remained constant! Presumably, this is due to the lessons learned through the postmortem process.
Similar to how we shared our structured hiring process, we’d like to share our blameless incident postmortem process. We’ve used the current iteration of this process with minor modifications for the past 3 years.
Note that this is a snapshot of our actual internal documentation so there are references to Slack, Jira, and Confluence that you can replace with whatever tools you’re using. Just like any process we’ve posted, this is a living process and we’re constantly working on improving it as we use it.
Writing a Postmortem
When do You Write a Postmortem?
The simplest of guidelines is: anything bad you don’t want to happen again.
- Think about the impact and what we can learn from the postmortem:
- What were the business costs ($)?
- What was the impact on the users, both internal and external?
- What underlying problems were exposed that could lead to future issues?
- Common topics for postmortems:
- Rolled back deploys
- Failure of production functions
- What do you not want to write a postmortem for:
- Things with very low impact – (a minor bug that only affected 1 or 2 people)
- Minor recurring issues that you’ve just found out about: tracking server needs to be restarted every other day to release RAM
- Graphical glitches and minor UI bugs, things that only affect minor browsers
If you are still unsure if you should write a postmortem, ask postmortems mailing list.
- During the incident: Take notes
- Within 24 hours: Write the postmortem
- Within 1 week: Review and finalize the postmortem
During the Incident
No one is going to be or should be in the postmortem mindset while trying to quickly resolve an incident. It is highly recommended to take lots of notes during or directly after the incident and record them in a timeline as they occur. Pop open a text document and write notes of things you’ve tried and people you’ve talked to. They can be useful if only to prevent trying the same idea too many times. These notes are also critical in analyzing things like business impact, incident response time, etc. later.
Immediately after the Incident
Assign an owner to the postmortem immediately after the issue is fixed.
Postmortems are very time sensitive. Longer you take, more you’ll forget. Having a good postmortem is as valuable as resolving the incident because it will help prevent future issues.
Write it immediately after resolving the incident. If it is not possible to do so right afterwards (say it is 3:00 in the morning), finish it within 24 hours of resolving the incident.
Who Writes the Postmortem?
A combination of who caused it and who fixed it will decide who writes the postmortem. This is not about placing blame by making someone write it. Whoever is deeply involved with it, should write it as they will write the most insightful postmortem.
Priority of Writing the Postmortem
Writing a postmortem is top priority: cancel meetings, delegate interviews to others, put on hold current work.
After the postmortem is written within 24 hours, it needs to be reviewed. It should be reviewed and finalized within 1 week.
- Send all drafts to the postmortems mailing list.
- Create a new wiki page for the postmortem and use the Confluence commenting workflow for receiving feedback and iterating.
- Link to the postmortem page on the wiki page for postmortems. Also make that page the parent of the postmortem.
- We have a standardized time for postmortem reviews, every Tuesday from 16:40 – 17:30 (on everyone’s calendar).
- The owner must link the postmortem on the postmortem reviews agenda (the document is on the calendar invite).
Finalizing & Following Up
Once the postmortem review meeting has completed, the postmortem owner should create Jira issues for of the action items in the appropriate team projects and link to them in the postmortem document. The recent postmortems are checked and people are poked if there’s no progress within a week. Remember that postmortem action items should take priority over everyday work so make sure the work is actionable, achievable, and reasonably scoped.
The title will start with the date in YYYY-MM-DD format. It then will be followed by a short description well-suited for an email subject.
2015-03-03 Bid creation failed on mobile
Owner(s) and Reviewers
List the owner(s) and reviewers of the postmortem to make it easy in the future to find a knowledgeable person if something similar happens in the future. They are not the people to blame for the incident.
Write the summary for anybody at Thumbtack to read and come away with a general understanding, whereas the rest of the document may be technical in nature.
The summary is a simple paragraph that includes:
- Date of the incident
- Duration of the incident
- Affected services
- The business impact (revenue lost, bids missed, etc…)
- Root cause(s)
- Resolution (short term fixes, long term fixes, etc)
On 2015-03-02 at 10:00 PST some users pointed out that Prospect was down. The issue was mapped to a failure on the single-sign-on service and traced to a recent change on the website code base, which the SSO service needs to be in sync with. After several failed attempts to address the problem, a patch that removes the need to maintain was pushed at 15:00 PST.
The timeline breaks down the series of events from the first alert until the issue has been resolved. The start of each event is the time of the event. If the timeline runs across more than a single day, use headings for the days to break up the timeline.
Curate the information by grouping and digesting to improve readability. A use-case is condensing a Slack conversation into a single timeline entry instead of listing each response as a separate event.
- 10:00 PST – It was mentioned on the eng Slack channel that Prospect was still inaccessible and the lack of metrics was blocking the SEO team. It was also pointed out that someone had already tried to restart it, but that did not solve the problem.
- 10:27 PST – The discussion was moved to the emergency Slack channel and the eng mailing list was notified of the issue. The error logs on frankie showed that there was an issue with the authentication mechanism, not Prospect itself.
- 10:53 PST – An attempt to rollback the SSO service to an earlier version (a new version had been published the Friday before) did not solve the problem.
Root Cause Analysis
The root cause analysis is a straightforward description of the core root of the problem. Not all the events leading up to triggering. Don’t sugarcoat your analysis, but do avoid using people’s name.
Frequently humans find themselves not knowing the root cause, but we tend to speculate on the reason. Do not report speculations, guesses, “I think”, etc. as fact. If you don’t know, simply state it. Your speculations and questions may still be valuable, in that case, state them clearly as open questions.
We suspect the root cause is with how JSON encoding was changed PHP 5.90 vs 5.89 because when we rolled back PHP versions and the problem was resolved.
We do not know the root cause of the issue. Could it be the changes to JSON encoding in PHP 5.90 vs 5.98?
The PHP code base is used to generate users and update passwords. The hash of the passwords is stored in the database and then used by the SSO service to authenticate users. This means that both PHP and Python must be using the same versions of the same cryptographic algorithms at all times.
An update on the PHP version bumped the bcrypt algorithm from version “2a” to “2y”, making it incompatible with the available implementation on the Python side which resulted in “Invalid salt” errors and failed authentication attempts.
What was the business/customer impact?
There are several reasons we talk about the business and customer impact. Postmortems almost always have action items. These actions items need to be prioritize. By quantifying the impact, we can make better decisions about which work get prioritized.
Additionally, there could be long term impact that results in increased tickets to customer support. It is a good idea to be aware of how many were affected.
Questions to consider answering:
- What percentage of Y didn’t work and for what percentage of audience?
- 100% of iOS Pro App users…
- Which features didn’t work as a result?
- …were unable to submit bids…
- How long was X down?
- …for 1 hour.
- How much money did we lose approximately?
- …X bids failed to be created which at an average of $Y revenue per bid means there was roughly $Z of lost revenue
- How many customer support tickets were generated? (How much does each ticket cost?)
- …about 14 customer tickets were created…
- What were the less tangible impacts?
- …less tangible impact as a result that is some loss of trust…
100% of iOS Pro App users were unable to submit bids for 1 hour. Usage of the app is still low so less than X bids failed to be created which at an average of $Y revenue per bid means there was roughly $Z of lost revenue. Additionally, about 14 customer tickets were created about the issue and there was less tangible impact as a result that is some loss of trust for those (and the other affected users) about Thumbtack’s reliability.
Where do these numbers come from?
The conversion numbers like revenue per bid are all available in the daily and monthly tracking company dashboards.
What could be better?
Depending on the incident, you’ll have several ways you may want to author this section:
- Identifying the failed system or processes and how to improve them.
- Here you’ll add the fixes – you may break them up into short and long term fixes.
- Explain what we could have done to prevent the issue to begin with.
- Explore automate responses to prevent or be notified earlier of a similar event.
First identifying what could be better then followed by fixes:
- We should have been able to catch the problem before release. It was not subtle and affected core functionality of the pro app.
- Even when the change got out, we should have had mechanisms to inform us of a problem immediately. Again, this was not a subtle problem, core functionality broke.
So how do we fix these?
1) Better Automated Testing
This breaking change was not caught with any existing automated testing. …
2) Better Monitoring
In addition to not catching this beforehand, there was no automated notifications…
First explain the immediate fix then followed with what to do to prevent it from happening in the future:
What was done to fix it?
Instead of depending on having the same version of the bcrypt algorithm, the Python method that validates a user’s password now launches a subprocess to use PHP’s password_hash function. This ensures that the same code is used at all times.
What could we have done to prevent it?
- Better testing on all services that depend on user creation and management.
- Better design of the SSO service, not making it mutually dependent with the PHP code base.
- Use an up-to-date OS so the packages we depend upon are not outdated
What went well?
With a downer of document, we should still point out what went well doing the incident. To help keep the document fact and not blame based, don’t include people’s names or refer to teams.
- What processes or systems worked?
- What other systems do we have in place that helped minimize the impact?
Monitoring is awesome! We could see that the tracking system failed to store events during this time, and once the bug with the tracking error reporter is fixed, we can see why in the error logs. Here is what the failed events metric looked like (note the y-axes don’t match).
List out the concrete next steps that we need to take as a result of this incident. For each step, link it to the Jira issue. It is recommended assigning an owner for each task; an owner is responsible for the item getting done even if they aren’t the one executing on it.
Making sure you have considered these 4 questions in your action items tends to be a good litmus test:
- Have we put the correct checks in place to make sure the specific issue will not happen again?
- Is there anything that would have made it easier to debug the issue or fix it faster?
- Is there anything we could have done to catch the issue sooner?
- Is there anything we can do to prevent issues like it?
This is a list of concrete next steps that we should take as a result of this incident. If you’re assigned to an item, you don’t necessarily need to be the one to do them but you are responsible for the item getting done. These are in rough priority order.
- (jones) Add an alert for 422’s for the tophat bid create endpoint [TASK123]
- (jones) Add a monitoring rule for mobile bid creation and alert if it flat lines (this is really important on the impact versus effort graph). [TASK124]
- (tgnourse) Finish getting Tophat integration tests running continuously on Jenkins [TASK125]