The Great S3 Outage of July 2008

When S3 launched less than 2.5 years ago, it was clear that it was a game-changer as far as storage infrastructure was concerned. Today’s outage, however, really reveals just how dependent many webapps have become on the service.

I use S3 for personal storage, as well as for static resources like my tumblr music widget (which does ~500k hits per month). For roughly $2/mo, I get performance and reliability which is least on par with what I could do on my own for several times the cost. I don’t carry a pager monitoring my personal web assets, and if it’s down for a few hours once or twice a year, I can live with that.

However, we also use S3 for business purposes, for serving our video ad player and media assets, and clearly this outage is a bit more of a concern, and I’m sure we will have a big conversation tomorrow about mitigating the effects of this happening again.

Personally, I am of the opinion that cloud storage is the best choice for us, and will only increase in reliability and performance over time. CDNs are still priced prohibitively expensive for our needs, but we may need to bite the bullet at some point, as we get bigger and have our own SLAs to adhere to. Ultimately though, it comes down to having a disaster recovery plan, and being ready to act on it.

So what will we do to be prepared for next time?

  1. Handle any detectable errors gracefully. Unfortunately, our customer-facing endpoint (our ad player) is hosted directly on s3, and often has to be embedded directly, so there isn’t much we can do if the user can’t download that.  It is, however, in a different bucket than our media assets.
  2. Drop DNS TTLs on our ad/asset hosts down to the smallest reasonable time (probably 5 minutes). Our DNS is distributed enough to handle the additional load without problem.
  3. Maintain a local copy of all S3 content, with matching taxonomy.
  4. Setup a basic static content server that can serve as a failover for the s3 buckets
  5. Manually failover our DNS to our backup servers as soon as a major outage is detected

The goal isn’t for this setup to be more reliable than S3, but to reduce the risk of a complete outage (like today), by having redundant options available.