by Bernard Golden

Cloud SLA: Another Point of View

Nov 16, 2009

Critics are piling on the gripes about cloud computing service level agreements. But let's discuss the assumption that enterprise data centers operate at far higher availability rates than cloud providers: Frankly, I'm unconvinced.

You’ve probably seen a hundred—or even a thousand—articles criticizing cloud computing Service Level Agreements (SLAs). A common example in those articles is the putatively low Amazon Web Services SLA. Typically authors of these kind of articles go on to cite recent outages by cloud providers, implying (or stating directly) that cloud computing falls woefully short of the true SLA requirements of enterprises, often described as “five nines,” i.e., 99.999% availability.

Left unsaid in these articles is the assumption that enterprise data centers operate at far higher availability rates than cloud providers. Frankly, I’m unconvinced of this. It seems that nearly every time I contact a large company’s support, the very nice call center representative apologizes for a delay caused by “the computer running slow this morning.” Plenty of people I interact with fume because their email is down, etc. So the comparison slighting cloud providers versus internal data centers may be inaccurate; however, it may be that the accuracy is in fact, unprovable, as many enterprises do not actually measure real-world SLA performance.

[For timely cloud computing news and expert analysis, see’s Cloud Computing Drilldown section. ]

Something else that should be noted is the absence real-world “five nine” existence. If one looks at the Uptime Institute’s definition of the various tiers of data center robustness, the scale goes from Tier 1, with single path of components and single points of failure, up to Tier 4, with multiple cooling equipment and power paths, along with redundant components. However, even Tier 4 achieves only 99.995% availability, below the “five nines” standard.

Moreover, there is an apples-to-oranges element of comparison here as well. The cloud providers not only deliver infrastructure (the domain of the Uptime Institute’s definitions) but deliver software capability as well. In the case of Amazon, the software capability consists of hypervisors, storage management, and cloud management software. Google and Microsoft layer additional platform functionality on top of what Amazon provides.

If one were to compare internal data center SLAs that incorporate the software layer as well as the infrastructure, one could hazard a guess that the cloud providers would start to look a lot better.

However, this mode of assessment is increasingly out of date, based as it is on a world of small amounts of expensive hardware running software designed to squeeze onto resource-constrained servers. That approach to system design is no longer leading edge. Why?

In Transaction Processing, written by Jim Gray and Andreas Reuter, the authors posit that hardware is far more reliable than software; consequently, the book spends no time on hardware strategy for figuring out how to raise uptime. Rather, the book discusses how to design software to be transaction-safe (hint: use a relational database that has transaction logging). Gray and Reuter assume a world of small amounts of hardware with applications spanning one or a few machines—but that approach is being left behind in today’s world of big data and massive applications.

Applications today span dozens, hundreds, or even thousands of machines. The scale of data centers is staggering. Microsoft’s new data center in Chicago contains over 400 thousand machines. If you read the articles and studies being published, Google and Microsoft can no longer rely on assuming the robustness of hardware. At these scales, hardware failure is a constant. Rather than treating hardware as a limited resource to be conserved, with reliability increased by purchasing expensive systems, these providers accept ongoing failure, purchase consumer-grade hardware, and increase robustness by significant redundancy—usually triple sets of equipment.

So one fact in today’s systems is that the assumption of hardware robustness is no longer tenable. That means that system designers must incorporate individual hardware component failure into application design. The breezy dismissal of hardware as a factor in system reliability that Gray and Reuter maintained must be re-evaluated in light of scale.

The need for massive hardware redundancy calls for a different approach to application design as well. Cloud applications are designed to spread across large numbers of machines and be ready to incorporate multiple copies of the code and application data. The consistency approach advocated by Gray and Reuter is inadequate for this type of application design. With so many machines and so many copies of the data, placing a transactional database at the center of the system where consistent state resides creates a hot spot bottleneck and throttles the available performance of the application—not to mention the loss of application uptime as the overloaded database crashes in the face of the load.

The entire foundation of today’s evaluation systems for uptime must be rethought. The emphasis on increasing application robustness by eliminating single point of failure through pair redundancy of expensive resources is inadequate—not to mention economically non-viable in a world of super-scale. In fact, in one paper on the topic, Microsoft data center designers advocate considering elimination of any pair redundancy of critical resources within the data center; instead, the paper advocates redundancy at the data center level, with “consumer-grade” data centers failing over to one another in case of resource unavailability.

In short, I’m unconvinced of the common criticism of cloud computing as providing inadequate SLAs. If one looks at traditional architecture applications, it’s unclear just what availability internal data centers actually provide, and criticizing cloud providers for low SLAs overlooks the fact that they take responsibility for software layers as well as hardware, while the common measures for internal data centers focus solely on infrastructure availability.

When one turns to the next generation of applications, though, the inappropriateness of traditional availability measures becomes really clear. In a world of big data and applications spanning hundreds or thousands of machines (or, indeed, spanning multiple data centers), trying to apply measure designed for a world of small apps running on expensive hardware seems pointless. Perhaps it’s time to rethink the concept of SLA.

As a final thought, it seems that a new generation of mainstream business applications are on the horizon—think Google architecture for ERP. Instead of creating islands of huge cost pools (enormously expensive software put onto expensive clustered hardware) companies will use external providers who write the apps to leverage massive cheap hardware redundancy with a software architecture designed for large pools of hardware resources—all paid for with a per user per month fee. And don’t kid yourselves that companies won’t jump to this kind of offering if it comes in at 20% of the loaded cost of the on-premises version. I recently listened to a really interesting podcast about Workday, a new HR SaaS provider. According to the speaker, they’ve picked up several customers who balked at the cost of paying for an on-premises software upgrade. It’s amazing how willing people are to take on some additional risk to save a ton of money. The future is going to be interesting.

Bernard Golden is CEO of consulting firm HyperStratus, which specializes in virtualization, cloud computing and related issues. He is also the author of “Virtualization for Dummies,” the best-selling book on virtualization to date.

Follow Bernard Golden on Twitter @bernardgolden. Follow everything from on Twitter @CIOonline.