You're Probably to Blame (Too) if Amazon's Cloud Outage Knocked Your Site Offline

Everyblock is one of the sites that was knocked offline by Amazon's cloud outage last week. But refreshingly, they're not placing all the blame on Amazon.

Everyblock is one of the sites that was knocked offline by Amazon's cloud outage last week. But refreshingly, they're not placing all the blame on Amazon. "Frankly, we screwed up," wrote Paul Smith, a member of the site's tech team, in a blog post Friday:

AWS explicitly advises that developers should design a site's architecture so that it is resilient to occasional failures and outages such as what occurred yesterday, and we did not follow that advice. . . . had we deployed our various servers across multiple AZs [availability zones] and taken into account the fact that individual servers and other services that AWS provide can and do go down from time to time, we would likely have remained available during this disruption.

Thank you, Paul Smith. Because yes, if you read Amazon's "AWS Web Hosting Best Practices (PDF)," it's pretty clear:

Availability Zones are distinct locations that are engineered to be insulated from failures in other Availability Zones and provide inexpensive, low latency network connectivity to other Availability Zones in the same Region. As can be seen in the AWS web hosting architecture, it is recommended to spread EC2 hosts across multiple Availability Zones since this provides for an easy solution to making your web application fault tolerant. Care should be taken to make sure that there are provisions for migrating single points of access across Availability Zones in the case of failure.

I don't mean to let Amazon off the hook on this. A multi-day failure affecting multiple zones throughout its East region is a serious issue. They reported that "some multi AZ failovers are taking longer than expected" -- and customers who architected across multiple zones had a special right to be angry about that. On the other hand, customers were warned that if they want fault tolerance, they need to build in failover across more than one zone. If they didn't and that zone then went offline and a Web went goes down, whose fault is that?

"Just wanted to point out that @SimpleGeo remained up throughout EC2 outages as we're redundant across multiple AZs," Joe Stump tweeted during the outage, as Everyblock's Paul Smith pointed out.

The problem here isn't that "the cloud" can't be trusted. The issue is that properly deploying a cloud app isn't as easy as buying a few server instances. But then again, deploying a mission-critical app in your own data center isn't as simple as setting up a few servers either.

Cloud executive Ahmar Abbas (senior vp of cloud services at CSS Corp.) notes over at IT World that "Organizations that leverage native AWS capabilities, such as creating Amazon Machine Images (AMI) for all applications, utilizing snapshots and leveraging one or more of the other 4 geographically isolated AWS regions, can successfully weather these outages." Netflix -- an Amazon cloud customer that did not go down last week -- last December shared "5 Lessons We've Learned using AWS" including the issue of planning for failure.

"I knew to expect higher rates of individual instance failure in AWS," wrote John Ciancutti.

Your best bet is to build your systems to expect and accommodate failure at any level. . . . One of the first systems our engineers built in AWS is called the Chaos Monkey. The Chaos Monkeys job is to randomly kill instances and services within our architecture. If we aren't constantly testing our ability to succeed despite failure, then it isnt likely to work when it matters most in the event of an unexpected outage.

A big advantage of cloud computing is the ability to easily scale up more affordably. Yes, it can also be simpler if you want to quickly put up a non-critical application: an in-house demo of a project in the works, or even a part of a production application that doesn't really matter much if it goes offline for awhile. But here are two lessons I'd take away from last week's outage. Moving to the cloud doesn't eliminate the need for skilled IT professionals to architect your application properly. And if you don't follow your provider's advice, expect to get burned.

Sharon Machlis is online managing editor at Computerworld. Her e-mail address is You can follow her on Twitter

@sharon000, on Facebook

This story, "You're Probably to Blame (Too) if Amazon's Cloud Outage Knocked Your Site Offline" was originally published by Computerworld.


Copyright © 2011 IDG Communications, Inc.

Discover what your peers are reading. Sign up for our FREE email newsletters today!