Lessons From Amazon Cloud Lightning Strike Outage

A lightning strike in Dublin took out a power transformer. In and of itself, that isn't all that unusual or noteworthy, but this particular lightning strike also impacted the backup power systems at Amazon's cloud data center, knocking the service offline. Looking back, there are some lessons to be learned both for Amazon, and for businesses that rely on cloud services.

By Tony Bradley
Wed, August 10, 2011

PC World — A lightning strike in Dublin took out a power transformer. In and of itself, that isn't all that unusual or noteworthy, but this particular lightning strike also impacted the backup power systems at Amazon's cloud data center, knocking the service offline. Looking back, there are some lessons to be learned both for Amazon, and for businesses that rely on cloud services.

We're talking about a massive Amazon data center. Data centers are built from the ground up with backups and failovers designed to address virtually any scenario and ensure the survivability and availability of the data center no matter what sort of catastrophe strikes. Amazon, of course, has redundant mechanisms in place, but obviously they didn't work in this case.

Mitigating the Risk of Cloud Services Failure: How to Avoid Getting Amazon-ed

On its Service Health Dashboard site for the European EC2 cloud service, Amazon explains, "Normally, upon dropping the utility power provided by the transformer, electrical load would be seamlessly picked up by backup generators. The transient electric deviation caused by the explosion was large enough that it propagated to a portion of the phase control system that synchronizes the backup generator plant, disabling some of them. Power sources must be phase-synchronized before they can be brought online to load. Bringing these generators online required manual synchronization."

In a nutshell, the lightning strike was direct and powerful enough that it simultaneously took out the transformer, and phase control system necessary for initiating the backup generator system. Amazon is in the process of restoring service and data for customers--a process that is taking longer than expected, and has required Amazon to add additional server capacity to handle the load.

So, what are the lessons to be learned here? Well, Amazon should do a post mortem once the service is fully recovered. First, Amazon should analyze the circumstances that led to both primary and backup power being impacted at the same time. It should determine the likelihood of such an event occurring again, and what--if anything--can be done to avoid it. Perhaps the backup power should be on a different grid from the primary power, or maybe this is such a fluke incident that such an investment is cost-prohibitive.

Next, Amazon should review the recovery and restoration process. It should consider the hurdles and stumbling blocks it has encountered--like needing additional server capacity to handle the load more efficiently--and it should revise incident response processes and procedures to make any future disaster recovery operations more effective and efficient.

Continue Reading

Originally published on www.pcworld.com. Click here to read the original story.
Our Commenting Policies