Postmortem — S3 Outage

Several services on AWS became degraded and broke a ton of stuff on the net last night. We survived, but our static assets are hosted on S3 and served straight out of US-STANDARD — so there were some problems.

twitter_status_update

Original incident: http://status.status.io/pages/incident/51f6f2088643809b7200000d/55c8514ff3bc431d5c000643

What happened:

The first alarm came via PagerDuty from a Pingdom transaction check, which indicated that a test login failed.

Immediately we realized that the static resources (images, scripts) hosted on Amazon S3 were sporadically failing to load. A quick manual test of the S3 connection confirmed it was broken.

The AWS status page still didn’t report any issues. Before digging any further, we scanned the interwebs and found a new thread on Hacker News. And someone even referenced a planned maintenance that correlated with the timing.

hacker_news_comment

Just as we were ready to contact AWS, the status page was updated to indicate that they were investigating the issue. S3 was fully operational within about four hours.

aws_s3_status_updates

Impact:

  • Degraded user experience due to randomly missing images, scripts or styles
  • Slow page loads
  • Broken subscription modal (we were rightfully called out on this one)

user_complaint

Next steps:

Our static assets already reside on S3 in multiple regions. However the bucket name is loaded via a config file that would require a code deployment, and performing deployments is not something we desire to do when AWS is in an unstable state.

We learned that we must be able to change the location of our static assets with the flip of a switch in production. Ideally we would use DNS to handle this change automatically.

In addition, we’re planning to expand our CDN providers and also support serving the assets from our own cache.

Version 0.7.6

[Improvement] Added ability to modify the time for previous incident or maintenance status updates

[Improvement] Datepicker for scheduled maintenances only shows dates after that start date

[Bug] Fixed metrics timezone issue

[Bug] Fixed billing view to display current plan

[Bug] Fixed a bug that caused 404 errors for a missing incident or maintenance to be redirected to an undefined URL

Version 0.7.5

[Feature] Ability to backfill historic incidents and maintenances

[Feature] Preview HTML notification templates

[Feature] Ability to set default All Systems Operational text

[Improvement] Last updated timestamps are now included in the Status API

[Improvement] Improved custom SSL certificate settings

[Bug] Fixed a maintenance automation bug that was causing all components to be affected

[Bug] Fixed a bug causing errors from PagerDuty status updates

[Bug] Fixed an issue with the maintenance scheduling date pickers

[Bug] Fixed a bug that prevented the removal of an existing custom domain

Version 0.7.4

[Improvement] Custom email sending domain SPF and DKIM records are now checked automatically and the status is displayed near the custom email address settings

[Improvement] Ability to add a link to external postmortems

[Improvement] Added new variable for inserting the event URL into a notification message (incident_urlmaintenance_url)

[Improvement] Maintenance automation interval set to 1 mintue

[Improvement] Team member invites now use an invite code instead of email address validation

[Improvement] Standardized on lower-case email addresses for users and subscribers

[Improvement] Prevent blank names for components or containers

[Improvement] “Custom Message Subject” is now a static field when creating incidents or maintenances

[Bug] Fixed various minor CSS bugs

[Bug] Fixed SMS validation bug

[Bug] Location maps are no longer visible if no locations are linked to containers