The Amazon Outage in Perspective: Failure Is Inevitable, So Manage Risk
The most recent Amazon Web Services outage left customers (and rival cloud providers) blaming Amazon. Instead, CIO.com columnist Bernard Golden says, everyone needs to accept that cloud computing is not immune to failure. Fortunately, a key advantage of the cloud -- cheap, easy redundancy -- will help mitigate the risk of an outage.
An endless stream of tweets and blog posts have noted, described and bewailed last week’s Amazon Web Services outage. Some people characterized the outage as an indictment of public cloud computing in general. Others, some of whom work at other cloud providers, characterized it as indicative of AWS-specific shortcomings. Still others used the event as an opportunity to outline how users have to be sure to hammer home SLA penalty clauses during contract negotiations, just to ensure protection from outages.
Most of these responses are reflective of bias or the commenter’s own agenda and fail to draw the proper lessons from this outage. More crucially, they fail to offer really useful advice or recommendations, preferring to proffer outmoded or alternative solutions that do not provide appropriate risk mitigation strategies appropriate for the new world of IT.
The first thing to look at is what risk really is. Wikipedia calls it “the probable frequency and probable magnitude of future loss.” In other words, risk can be ascertained by how often a problem occurs and how much that problem is likely to cost. Naturally, one has to evaluate how valuable mitigation efforts to address a risk are, given the cost of mitigation. Spending $1 to protect oneself against a $1,000 loss would seem to make sense, while spending $1,000 to protect oneself against a $1 loss is foolish.
Amazon Outages Show That Failure Is An Option
The question for users is whether this outage presents a large enough loss that continuing to use AWS is no longer justified (i.e., is too risky) and that other solutions should be pursued. Certainly there are now applications running on AWS that represent millions or even tens of millions of dollars of annual revenue, so this question is quite germane.
In terms of this specific outage, Amazon posted an explanation that describes it as a combination of some planned maintenance, a failure to update some internal configuration files and a programmatic memory leak. The result was poor availability of Amazon’s Elastic Block Storage (EBS) service.
Interestingly, the last large AWS outage was also an EBS failure, although even more interestingly, it had an entirely different cause, though human error was the trigger for the previous outage as well. In both cases, someone misconfigured an EBS resource, which triggered an unexpected condition, resulting in a service outage.
Most interesting of all, AWS says users shouldn’t be surprised by this occurrence. Amazon’s No. 1 design principle: “everything fails all the time; design your application for failure and it will never fail.”
Many people are outraged by this, feeling that a service provider should take responsibility for ensuring 100% (or at least “five nines”) of service availability. Amazon’s attitude, they imply, is irresponsible. The right solution, they say, is that users should look to a provider that is willing to take responsibility and provide a service that is truly reliable, made possible by use of so-called “enterprise-grade” hardware and software backstopped by ironclad change control.
There Is No “Right” Equipment, No Matter What Your SLA Says
There’s only one problem: the solution proposed by commenters is outmoded, inappropriate and unsustainable.
First, it assumes that availability can be increased by use of enterprise-grade equipment. The fact is, every type of equipment fails, often at inconvenient times. Believing that availability can magically improve by simply using the “right” equipment is doomed to failure.
Resource failure is an unfortunate reality. The primary issue is what user organizations should do to protect themselves from hardware failure. It’s what they should really do, too. I view the “negotiate harder on the SLA” strategy as akin to “the beatings will continue until morale improves,” meaning that it makes the SLA-demander feel better but is unlikely to result in any actual improvement.
Many of the cloud providers commenting on the AWS outage propose this kind of solution. In my view, this demonstrates how poorly they understand this issue. Their hardware will fail, too. Those engaged in taunting a competitor when it experiences a service failure should remember that pride goes before a fall.
Second, ironclad change control processes are not actually going to reduce resource failure. This is because anything involving human interaction is subject to mistakes, which results in failure. It’s instructive to note that both major AWS outages were not the result of hardware failure, but of human error—specifically, human error that interacted with system design assumptions that failed to account for the type of error that occurred. And even organizations that are strongly ITIL-oriented experience human-caused problems.
Finally, the solutions proposed don’t account for the world of the future. Every company is going to experience a massive increase in IT scale; believing that just putting in place rigid enough processes, with enough checks and balances, will reduce failure just doesn’t recognize how inadequate that approach is for this new IT world. No IT organization (and no cloud provider) will be able to afford enough people (or enough enterprise-grade equipment) to pursue this type of solution.
Redundancy, Failover Have Been Best Practices For a Long Time
The true solutions for resource failure has long been known: redundancy and failover. Instead of a single server, use two; if one goes down, it’s possible to switch over to the second to keep an application running. It’s just that, in the past, implementing redundancy was unaffordable except for a tiny percentage of truly mission-critical applications, given the cost of hardware and software.
The genius of cloud computing is that it offers the ability to address this redundancy easily and cheaply. Many users have designed their apps to be resilient in the face of individual resource failure and have protected themselves against it—unlike those who pursue the traditional solutions proffered by many commenters which will, inevitably, result in an outage when the enterprise-grade equipment fails.
The more troubling situation is the infrequent failures that have human error involved, which result in more widespread service failure. In other words, it’s not just one application’s resources being unavailable, but a service being out for a large number of applications.
It’s tempting to believe the problem is that Amazon just doesn’t have good process or smart enough people working for it and that, if those aspects were addressed by it (or another provider), then these infrequent failures wouldn’t occur.
This attitude is wrong. These corner case outages will continue, unfortunately. We are building a new model of computing—highly automated and vastly scaled, with rich functionality—and the industry is still learning how to operate and manage this new mode of computing. Inevitably, mistakes will occur. The mistakes are typically not simple errors but, rather, unusual conditions triggering unexpected events within the infrastructure. While cloud providers will do everything they can to prevent such situations, they will undoubtedly occur in the future.
In the End, It Comes Down To Risk
What is the solution for these infrequent yet widespread service outages? AWS recommends more extensive redundancy measures that span geographic regions. Given AWS scoping, that would protect against region-wide resource unavailability. There’s only one problem. Implementing more expansive redundancy is complex and expensive—far more so than the simpler measures associated with resource redundancy.
This brings us back to the topic of risk. Remember, it’s frequency probability measured against magnitude of loss associated with a failure. You have to evaluate how frequently you expect these less-frequent, larger-scale resource failures to occur and compare that to the cost of preventing them via design and operations. In some sense, one is evaluating the cost of careful design and operation vs. the cost of a more general failure.
Certainly the cost of the design and operation can be worked out, while many people prefer to avoid thinking of the cost of a more widespread failure that would take their application offline. However, as more large revenue applications move to AWS, failing to evaluate risk and implement appropriate failure-resistant measures will be imprudent.
Overall, it’s not as though the possibility of these outages is unknown, or that the appropriate mitigation techniques are easily discoverable as well. You should expect that CSPs will suffer general resource outages and not blame the provider in the event of such an outage. Instead, you should recognize that you made a decision without perhaps acknowledging the risk associated with it. Those who look at these outages and choose to do nothing more than damn the provider and demand perfection don’t recognize how dangerous a game they are playing.
Bernard Golden is the vice president of Enterprise Solutions for enStratus Networks, a cloud management software company. He is the author of three books on virtualization and cloud computing, including Virtualization for Dummies. Follow Bernard Golden on Twitter @bernardgolden.
Named by Wired.com as one of the 10 most influential people in cloud computing, Bernard Golden serves as vice president of strategy for ActiveState Software, an independent provider of CloudFoundry. He is the author of four books on virtualization and cloud computing, his most recent book being Amazon Web Services for Dummies. Learn more about him at www.bernardgolden.com.