The Amazon Outage in Perspective: Failure Is Inevitable, So Manage Risk
The most recent Amazon Web Services outage left customers (and rival cloud providers) blaming Amazon. Instead, CIO.com columnist Bernard Golden says, everyone needs to accept that cloud computing is not immune to failure. Fortunately, a key advantage of the cloud -- cheap, easy redundancy -- will help mitigate the risk of an outage.
Thu, November 01, 2012
CIO — An endless stream of tweets and blog posts have noted, described and bewailed last week's Amazon Web Services outage. Some people characterized the outage as an indictment of public cloud computing in general. Others, some of whom work at other cloud providers, characterized it as indicative of AWS-specific shortcomings. Still others used the event as an opportunity to outline how users have to be sure to hammer home SLA penalty clauses during contract negotiations, just to ensure protection from outages.
Most of these responses are reflective of bias or the commenter's own agenda and fail to draw the proper lessons from this outage. More crucially, they fail to offer really useful advice or recommendations, preferring to proffer outmoded or alternative solutions that do not provide appropriate risk mitigation strategies appropriate for the new world of IT.
The first thing to look at is what risk really is. Wikipedia calls it "the probable frequency and probable magnitude of future loss." In other words, risk can be ascertained by how often a problem occurs and how much that problem is likely to cost. Naturally, one has to evaluate how valuable mitigation efforts to address a risk are, given the cost of mitigation. Spending $1 to protect oneself against a $1,000 loss would seem to make sense, while spending $1,000 to protect oneself against a $1 loss is foolish.
Amazon Outages Show That Failure Is An Option
The question for users is whether this outage presents a large enough loss that continuing to use AWS is no longer justified (i.e., is too risky) and that other solutions should be pursued. Certainly there are now applications running on AWS that represent millions or even tens of millions of dollars of annual revenue, so this question is quite germane.
In terms of this specific outage, Amazon posted an explanation that describes it as a combination of some planned maintenance, a failure to update some internal configuration files and a programmatic memory leak. The result was poor availability of Amazon's Elastic Block Storage (EBS) service.
Interestingly, the last large AWS outage was also an EBS failure, although even more interestingly, it had an entirely different cause, though human error was the trigger for the previous outage as well. In both cases, someone misconfigured an EBS resource, which triggered an unexpected condition, resulting in a service outage.