Amazon Promises to Improve Redundancy After Dublin Outage
Affected users will receive service credits for either 10 or 30 days
Mon, August 15, 2011
IDG News Service — Amazon Web Services (AWS) learned a lot of lessons from the outage that affected its Dublin data center, and will now work to improve power redundancy, load balancing and the way it communicates when something goes wrong with its cloud, the company said in a summary of the incident.
The post mortem delved deeper into what caused the outage, which affected the availability of Amazon's EC2 (Elastic Compute Cloud), EBS (Elastic Block Store), the RDS database and Amazon's network. The service disruption began Aug. 7, at 10:41 a.m., when Amazon's utility provider suffered a transformer failure. At first, a lightning strike was blamed, but the provider now believes it actually wasn't the cause, and is continuing to investigate, according to Amazon.
Normally, when primary power is lost, the electrical load is seamlessly picked up by backup generators. Programmable Logic Controllers (PLCs) assure that the electrical phase is synchronized between generators before their power is brought online. But in this case one of the PLCs did not complete its task, likely because of a large ground fault, which led to the failure of some of the generators as well, according to Amazon.
To prevent this from recurring, Amazon will add redundancy and more isolation for its PLCs so they are insulated from other failures, it said.
Amazon's cloud infrastructure is divided into regions and availability zones. Regions -- for example, the data center in Dublin, which is also called EU West Region -- consists of one or more Availability Zones, which are engineered to be insulated from failures in other zones in the same region. The thinking is that customers can use multiple zones to improve reliability, something which Amazon is working on simplifying.
At the time of the disruption, customers who had EC2 instances and EBS volumes independently operating in multiple EU West Region Availability Zones did not experience service interruption, according to Amazon. However, management servers became overloaded as a result of the outage, which had an impact on performance in the whole region.
To prevent this from recurring, Amazon will implement better load balancing, it said. Also, over the last few months, Amazon has been "developing further isolation of EC2 control plane components to eliminate possible latency or failure in one Availability Zone from impacting our ability to process calls to other Availability Zones," it wrote. The work is still ongoing, and will take several months to complete, according to Amazon.
The service that caused Amazon the biggest problem was EBS, which is used to store data for EC2 instances. The service replicates volume data across a set of nodes for durability and availability. Following the outage the nodes started talking to each other to replicate changes. Amazon has spare capacity to allow for this, but the sheer amount of traffic proved too much this time.