Coming Back Up is More Painful than Going Down

By Deane Barker on October 24, 2012

Lessons learned from the Amazon Web Services outage: This is a good post about what we should have learned from the recent AWS outage.  In particular, these two lessons are the most important.

The stress of failure will trigger a cascade of other failures. […] What started as a small issue affecting one Northern Virginia data center quickly spread, causing a chain reaction and outage that disrupted much of the Internet for several hours.

[…] Spikes matter. When a cloud fails, hundreds of customers are impacted. As they try to recover, they will be stressing the cloud provider’s infrastructure with a peak load that is guaranteed to cause even more problems.

That second one is important.

Reddit recently replaced their “we’re down” screen, but the old one had a picture of the Reddit alien lying on the ground unconscious with another alien standing over it.  The standing alien had a hammer and was about to hit the prone alien with it.  The hammer was labeled “F5.”

Very few things ever come back cleanly.