Lessons learned from the Amazon Web Services outage: This is a good post about what we should have learned from the recent AWS outage. In particular, these two lessons are the most important.
The stress of failure will trigger a cascade of other failures. […] What started as a small issue affecting one Northern Virginia data center quickly spread, causing a chain reaction and outage that disrupted much of the Internet for several hours.
[…] Spikes matter. When a cloud fails, hundreds of customers are impacted. As they try to recover, they will be stressing the cloud provider’s infrastructure with a peak load that is guaranteed to cause even more problems.
That second one is important.
Reddit recently replaced their “we’re down” screen, but the old one had a picture of the Reddit alien lying on the ground unconscious with another alien standing over it. The standing alien had a hammer and was about to hit the prone alien with it. The hammer was labeled “F5.”
Very few things ever come back cleanly.