Posts
1150
Comments
891
Trackbacks
1
Some more on the AWS Outage

SitePoint had an interesting take in its newsletter called “Two Important Lessons from the AWS Failure.” 

The first involves communication:

“The lesson here is clear—when you have any kind of crisis, communication with those affected is extremely important. In emergency mode, it may not be possible to pick up the phone to talk to a client or customer, but updating your website or changing the voicemail message can have a major impact.”

Tekpub was able to provide communication through Rob’s blog (I’m also assuming that since he rejoined Twitter, he also used that medium, but since I don’t use Twitter, that’s an assumption.  I would also not be surprised if he did an email blast to his customer base once he recognized the seriousness of the issue).  I’m assuming their customers knew about these communication channels, if not, I have little doubt that they will in the future.  Having been bitten by these sorts of things in the past, when you figure out lessons learned, you can really learn a lot.

The second point, which I talked about previously here involves having a contingency plan:

“As I was hearing of some extremely large websites being completely down due to the AWS outage, I couldn't help wondering why they built their systems without any redundancy or backup plan. Cloud computing is a relatively young industry, and although Amazon Web Services has been very reliable, failures happen.”

and

“One of the biggest advantages of cloud computing is its rapid scalability. It is entirely possible to setup two completely separate cloud environments, one at AWS and one at Rackspace for instance, and simply have one be a backup ready to be scaled up to production when a failure occurs (either manually or automatically).”

This wouldn’t address any potential data loss (which is what is scary about the AWS outage, some data was simple unrecoverable.  Ouch.), but would allow you to get back online quickly.  As always, though, cost is a factor.

ArsTechnica (among many others) gives a good summary of what happened:

“One factor contributing to the problems was that when nodes could not find any further storage to replicate onto, they kept searching, over and over again. Though they were designed to search less frequently in the face of persistent problems, they were still far too aggressive. This resulted, effectively, in Amazon performing a denial-of-service attack against its own systems and services. The company says that it has adjusted its software to back off in this situation, in an attempt to prevent similar issues in the future. But the proof of the pudding is in the eating—the company won't know for certain if the problem is solved unless it suffers a similar failure in the future, and even if this particular problem is solved there may well be similar issues lying latent. Amazon's description of the 2008 downtime had a similar characteristic: far more network traffic than expected was generated as a result of an error, and this flood of traffic caused significant and unforeseen problems.

Such issues are the nature of the beast. Due to their scale, cloud systems must be designed to be in many ways self-monitoring and self-repairing. During normal circumstances, this is a good thing—an EBS disk might fail, but the node will automatically ensure that it's properly replicated onto a new system so that data integrity is not jeopardized—but the behavior when things go wrong can be hard to predict, and in this case, detrimental to the overall health and stability of the platform. Testing the correct handling of failures is notoriously difficult, but as this problem shows, it's absolutely essential to the reliable running of cloud systems.”

One of the biggest selling points of cloud computing is the promise of cheap and scalable hosting.  When I was part of the group that supported eCommerce sites for the NBA, NASCAR and NHL, our infrastructure was easily six figures for hardware alone.  I don’t know about you, but I don’t have that kind of coin.  The problem is you have no control over failures.

For a comprehensive list of links related to the outage, check out HighScalability.

posted on Friday, April 29, 2011 6:44 PM Print
Comments
No comments posted yet.

Post Comment

Title *
Name *
Email
Url
Comment *  
Please add 6 and 5 and type the answer here: