Do Customers Share Blame in Amazon Outages?
A number of high-profile Amazon Web Services (AWS) outages temporarily brought down services by customers like Netflix and Pinterest in June. While Amazon bears the burden of blame for the outages, should customers also take some responsibility for failure to engage a backup host?
Thu, July 12, 2012
CIO — In June, Amazon Web Services (AWS) suffered two high-profile outages that left a number of customersincluding clients like Instagram, Pinterest and Netflixunable to provide services to their customers. But do AWS' clients share some of the responsibility?
"There's blame on both sides of the equation," says Jason Currill, founder and CEO of Ospero, a global Infrastructure as a Service (IaaS) company. "From the Amazon side, clearly I think there's an obvious issue with their redundancy power. After one outage, you'd think they'd have learned their lesson. Redundancy power is one of those elementary things that data centers are normally very, very good at."
Organizations Must Treat Cloud Providers as a UtilityBut the clients affected by the outage also share blame, Currill says, because they failed customers using their services.
"If you're a corporation and you have a building, you have a diesel generator in the basement in case the electricity goes out," he says. "You have two telco lines coming in so if you lose one, you still have communications. Cloud is the same thing. It's a utility. Have two."
Generator Failures, Software Bug Cause OutageIn a detailed post-mortem released after the most recent outage on June 29, Amazon cited a series of power outages, generator failures and rebooting backlogs that led to a "significant impact to many customers."
The problems began with a large-scale electrical storm in northern Virginia, in what Amazon designates its U.S. East-1 Region. U.S. East-1 consists of more than 10 data centers structured into multiple availability zones. This structure is designed to prevent exactly the sort of problem that occurred on June 29; availability zones run on their own physically distinct, independent infrastructure.
Common points of failure like generators and cooling equipment are not shared across availability zones. In theory, even disasters like fires, tornados or flooding would only affect a single availability zone and service should remain uninterrupted by routing around that availability zone to the others.
But on that Friday, when a large voltage spike occurred in the electrical switching equipment in two of the U.S. East-1 data centers, there was a problem bringing the generators online in one of the affected data centers.
The generators all started successfully, but each generator independently failed to provide stable voltage as it was brought into service. Since the generators weren't able to pick up the load, the data center's servers operated on their Uninterruptible Power Supply (UPS) units. The utility restored power a short time later. But then power went out again. Again the power failed to transfer to the backup generators and again the servers operated on their UPS power.