by Eric Knorr

Power Failures Contain Lessons for the Data Center Continuity

Nov 15, 20034 mins

Utility computing, a grand concept endorsed by every big tech company, is one of those rare futuristic ideas that enjoys an aura of inevitability. Once the big pipe reaches everywhere, why not pay for computing power as you need it rather than building your own high-maintenance data center? Makes sense to me.

At least it did, until the lights went out from Ohio to New York a few months ago. If all computing power were woven together into a massive grid and a disaster the scale of August’s blackout occurred, I’d want my old laptop back in a hurry.

Widely deployed utility computing is decades away, but some data centers are already starting to embrace utilitylike principles, preventing and fielding failures of individual systems as they attempt to keep the whole infrastructure intact. So I wondered: Are there any cautionary lessons to be learned from big blackouts that we can use to help avoid massive data center failure?

Live and Learn

To answer that question, I called up Ric Telford, IBM’s director of autonomic computing. First, some necessary nomenclature: “Autonomic computing” is an IBM phrase that refers to systems of any scale that self-configure, self-heal, self-optimize and self-protect. Grid computing, on the other hand, can be defined as a wide-area cluster (which can be autonomic or not). Both autonomic and grid computing are enablers of utility computing: If you can’t get scalable computing power over the wire reliably, you’re not going to pay for it.

At any rate, Telford indulged my metaphor and even expanded on it. “The analogy in computing systems is Web traffic,” he says. “You think about 9/11, you think about recent history when there were huge changes in the influx of traffic, and a lot of sites that prided themselves on being highly responsive to this kind of workload fluctuation failed.” And handling traffic and power spikes requires similar preventative measures.

“What the power grid needed was what we’re saying computing systems need: the ability to be both self-protecting to prevent failures from occurring in the first place and to be self-healing, which is the ability to correct those failures when they occur,” Telford says. According to Telford, self-protection in this context means that “you reach a point where a system can no longer optimize, the point where you’re at the edge of failure?where you can’t handle the spike. What can you do to prevent your whole system from going down?” Part of the answer, he says, is having an embedded routine that abandons normal attempts to optimize and devotes all resources just to keeping the system running. By contrast, the power companies’ self-protecting mechanism “was somewhat drastic?an automatic shutoff.”

When failure does occur, self-healing should kick in. “That didn’t come across as something the power grid did very well either,” Telford says. “Once a particular node went down, what was the problem with it just being able to bring itself back up again?” Problematic systems, however, should be isolated. “If a failure occurs in a database, and an application is dependent on that database, there’s no sense in keeping that application running if you know the database is down. It’s just going to potentially ripple more failure,” says Telford.

The analogy between a data center (or a vast future network of data centers powering utility-based computing) and the power grid is imperfect at best. But Telford agrees that there are lessons IT can learn from catastrophes on both fronts. “At every interconnect,” he says, “a CTO should be asking himself: What are we doing to ensure that failure doesn’t cascade across the whole system? Are the components of our systems self-protecting? It shouldn’t be: ’Gee, it’s self protecting?as long as you don’t do this.’”