Amazon's Data Center Outage Reads Like a Thriller

When an Amazon Web Services data center lost power early Wednesday, the details about the event trickled out over Amazon's operations status board. Performance monitoring firms offer services designed to get ahead of such problems.

By Patrick Thibodeau
Fri, December 11, 2009

Computerworld — When an Amazon Web Services data center lost power early Wednesday, the company wrote about the unfolding event with the brevity and tension of one its bestselling pot boilers.

Data Center Definitions and Solutions

Our anonymous author, who we'll call Sysadmin, begins his story simply, without emotional complications and love interests.

"We are investigating connectivity issues for instances in the US-EAST-1 region," Sysadmin writes on Amazon's operations status board at 1:08 a.m. PT.

With one sentence, we're intrigued. Something's up with Amazon's data center in Northern Virginia, just a short drive to Washington; Tom Clancy country.

You can almost feel what's going on. Cloud-based services are crashing and there's a scramble for answers. Elsewhere, PC screens are refreshed as readers wait for an update from Sysadmin, (Kindle edition not yet available). Some 18 minutes pass. Tension builds.

Sysadmin offers an update, referring to isolated "power issues."

Inside the data center a real, red-light-flashing drama unfolds.

At first, a "single component of the redundant power distribution system failed in this zone," Sysadmin would later write in a postscript for his audience. But while the data center staff worked on that component, there was a twist: "A second component, used to assure redundant power paths, failed as well."

Customers are losing connectivity.

Whether data center staff cheered when the problem was fixed remains a mystery. But as soon as the "defective power distribution units were bypassed, servers restarted and instances began to come online shortly thereafter," wrote Sysadmin.

Readers wouldn't get those details until later, when Sysadmin had more information and time. In those early minutes of the outage, only essential information gets to anxious readers. At 1:51 a.m., Sysadmin wrote: "The underlying power issue has been addressed. Instances have begun to recover."

At 2:11 a.m., he writes again: a recovery is well under way.

All that's left are the reviews. That's where companies like Wellesley Mills, Mass.-based Apparent Networks Inc. come in.

In November, Apparent Networks launched its Cloud Performance Center , an online service that allows anyone to review -- in real-time -- the performance of 16 cloud providers, including Amazon and Google . It covers such things as bandwidth capacity, latency and data loss, then scores them overall.

Jim Melvin, president of the privately held Apparent Networks, said his firm can continuously monitor network performance over WANs using technology it has extended to the cloud. The monitoring is done with a "very lightweight stream of packets" that continuously travels the network to monitor activity and cloud performance.

Continue Reading

Originally published on www.computerworld.com. Click here to read the original story.
Our Commenting Policies