Handling Postmortem Reports

Incidents with severe impact usually deserve a public postmortem report. The best way to restore trust and delight your users is with an authentic, detailed postmortem report.

While some organizations add postmortems to their status pages, we found that most are using their primary company blog.

We made it simple to link directly to these external postmortems from within an incident. Learn more

post-mortem-url-incident

There is no standard when it comes to writing a postmortem report. Basically, just be authentic and describe what happened, what was impacted, and what steps were taken to prevent this from reoccurring in the future.

We’ve compiled some resources to help you get started crafting postmortem reports that will delight your users. Below are a ton of examples of actual postmortems from some of the most popular services on the internet.

Example Postmortem Reports

Amazon Web Services EU-West Outage

Backblaze Data Center Outage

BBC Online Outage

Buffer Security Breach

CircleCI Database Performance Issue

CloudFlare Outage

DigitalOcean Power Issue

DNSimple DDoS Attack

Docker Security Vulnerability

Dropbox Outage

Etsy Site Outage

Facebook Downtime

Foursquare Outage

GitHub DNS Outage

Google Services Outage

Groove 15-Hour Outage

Heroku Application Outage

HipChat Service Issues

Joyent US-East-1 Outage

Keen IO Performance Issues

LastPass Data Center Outage

Linode Data Center Outage

Mailgun Delays

MaxCDN DNS Outage

Microsoft Azure Storage Service Interruption

Netflix Christmas Eve Outage

npm Outage

PagerDuty Downtime

PageKite DNS Outage

Pixafy Network Errors

puush Security Breach

Qualys Cloud Suite Outage

Shopify Asset Server Outage

Simple Performance Issues

Spotify European Outage

Stack Overflow Offline

Status.io S3 Outage

Twilio Billing Incident

Tropo Outage

Vero DDoS Attack

Zello Capacity Issues

And finally, here’s some good reads if you’re interested in improving your postmortem skills:

How To Apologize For Server Outages

System Down! An Application Outage Survival Guide

Documenting an outage for a post-mortem review

Blameless PostMortems and a Just Culture

Postmortem of an Outage

The Three Ingredients of a Great Postmortem

How to Write Great Outage Post-Mortems

Screencast – How to write an Incident Report / Postmortem

Postmortem — S3 Outage

Several services on AWS became degraded and broke a ton of stuff on the net last night. We survived, but our static assets are hosted on S3 and served straight out of US-STANDARD — so there were some problems.

twitter_status_update

Original incident: http://status.status.io/pages/incident/51f6f2088643809b7200000d/55c8514ff3bc431d5c000643

What happened:

The first alarm came via PagerDuty from a Pingdom transaction check, which indicated that a test login failed.

Immediately we realized that the static resources (images, scripts) hosted on Amazon S3 were sporadically failing to load. A quick manual test of the S3 connection confirmed it was broken.

The AWS status page still didn’t report any issues. Before digging any further, we scanned the interwebs and found a new thread on Hacker News. And someone even referenced a planned maintenance that correlated with the timing.

hacker_news_comment

Just as we were ready to contact AWS, the status page was updated to indicate that they were investigating the issue. S3 was fully operational within about four hours.

aws_s3_status_updates

Impact:

  • Degraded user experience due to randomly missing images, scripts or styles
  • Slow page loads
  • Broken subscription modal (we were rightfully called out on this one)

user_complaint

Next steps:

Our static assets already reside on S3 in multiple regions. However the bucket name is loaded via a config file that would require a code deployment, and performing deployments is not something we desire to do when AWS is in an unstable state.

We learned that we must be able to change the location of our static assets with the flip of a switch in production. Ideally we would use DNS to handle this change automatically.

In addition, we’re planning to expand our CDN providers and also support serving the assets from our own cache.

Version 0.7.6

[Improvement] Added ability to modify the time for previous incident or maintenance status updates

[Improvement] Datepicker for scheduled maintenances only shows dates after that start date

[Bug] Fixed metrics timezone issue

[Bug] Fixed billing view to display current plan

[Bug] Fixed a bug that caused 404 errors for a missing incident or maintenance to be redirected to an undefined URL

Version 0.7.5

[Feature] Ability to backfill historic incidents and maintenances

[Feature] Preview HTML notification templates

[Feature] Ability to set default All Systems Operational text

[Improvement] Last updated timestamps are now included in the Status API

[Improvement] Improved custom SSL certificate settings

[Bug] Fixed a maintenance automation bug that was causing all components to be affected

[Bug] Fixed a bug causing errors from PagerDuty status updates

[Bug] Fixed an issue with the maintenance scheduling date pickers

[Bug] Fixed a bug that prevented the removal of an existing custom domain