Cloud SLA: Another Point of View

Critics are piling on the gripes about cloud computing service level agreements. But let's discuss the assumption that enterprise data centers operate at far higher availability rates than cloud providers: Frankly, I'm unconvinced.

By Bernard Golden
Mon, November 16, 2009

CIO — You've probably seen a hundred—or even a thousand—articles criticizing cloud computing Service Level Agreements (SLAs). A common example in those articles is the putatively low Amazon Web Services SLA. Typically authors of these kind of articles go on to cite recent outages by cloud providers, implying (or stating directly) that cloud computing falls woefully short of the true SLA requirements of enterprises, often described as "five nines," i.e., 99.999% availability.

Left unsaid in these articles is the assumption that enterprise data centers operate at far higher availability rates than cloud providers. Frankly, I'm unconvinced of this. It seems that nearly every time I contact a large company's support, the very nice call center representative apologizes for a delay caused by "the computer running slow this morning." Plenty of people I interact with fume because their email is down, etc. So the comparison slighting cloud providers versus internal data centers may be inaccurate; however, it may be that the accuracy is in fact, unprovable, as many enterprises do not actually measure real-world SLA performance.

[For timely cloud computing news and expert analysis, see CIO.com's Cloud Computing Drilldown section. ]

Something else that should be noted is the absence real-world "five nine" existence. If one looks at the Uptime Institute's definition of the various tiers of data center robustness, the scale goes from Tier 1, with single path of components and single points of failure, up to Tier 4, with multiple cooling equipment and power paths, along with redundant components. However, even Tier 4 achieves only 99.995% availability, below the "five nines" standard.

Moreover, there is an apples-to-oranges element of comparison here as well. The cloud providers not only deliver infrastructure (the domain of the Uptime Institute's definitions) but deliver software capability as well. In the case of Amazon, the software capability consists of hypervisors, storage management, and cloud management software. Google and Microsoft layer additional platform functionality on top of what Amazon provides.

If one were to compare internal data center SLAs that incorporate the software layer as well as the infrastructure, one could hazard a guess that the cloud providers would start to look a lot better.

However, this mode of assessment is increasingly out of date, based as it is on a world of small amounts of expensive hardware running software designed to squeeze onto resource-constrained servers. That approach to system design is no longer leading edge. Why?

In Transaction Processing, written by Jim Gray and Andreas Reuter, the authors posit that hardware is far more reliable than software; consequently, the book spends no time on hardware strategy for figuring out how to raise uptime. Rather, the book discusses how to design software to be transaction-safe (hint: use a relational database that has transaction logging). Gray and Reuter assume a world of small amounts of hardware with applications spanning one or a few machines—but that approach is being left behind in today's world of big data and massive applications.

Applications today span dozens, hundreds, or even thousands of machines. The scale of data centers is staggering. Microsoft's new data center in Chicago contains over 400 thousand machines. If you read the articles and studies being published, Google and Microsoft can no longer rely on assuming the robustness of hardware. At these scales, hardware failure is a constant. Rather than treating hardware as a limited resource to be conserved, with reliability increased by purchasing expensive systems, these providers accept ongoing failure, purchase consumer-grade hardware, and increase robustness by significant redundancy—usually triple sets of equipment.

So one fact in today's systems is that the assumption of hardware robustness is no longer tenable. That means that system designers must incorporate individual hardware component failure into application design. The breezy dismissal of hardware as a factor in system reliability that Gray and Reuter maintained must be re-evaluated in light of scale.

The need for massive hardware redundancy calls for a different approach to application design as well. Cloud applications are designed to spread across large numbers of machines and be ready to incorporate multiple copies of the code and application data. The consistency approach advocated by Gray and Reuter is inadequate for this type of application design. With so many machines and so many copies of the data, placing a transactional database at the center of the system where consistent state resides creates a hot spot bottleneck and throttles the available performance of the application—not to mention the loss of application uptime as the overloaded database crashes in the face of the load.

The entire foundation of today's evaluation systems for uptime must be rethought. The emphasis on increasing application robustness by eliminating single point of failure through pair redundancy of expensive resources is inadequate—not to mention economically non-viable in a world of super-scale. In fact, in one paper on the topic, Microsoft data center designers advocate considering elimination of any pair redundancy of critical resources within the data center; instead, the paper advocates redundancy at the data center level, with "consumer-grade" data centers failing over to one another in case of resource unavailability.

In short, I'm unconvinced of the common criticism of cloud computing as providing inadequate SLAs. If one looks at traditional architecture applications, it's unclear just what availability internal data centers actually provide, and criticizing cloud providers for low SLAs overlooks the fact that they take responsibility for software layers as well as hardware, while the common measures for internal data centers focus solely on infrastructure availability.

When one turns to the next generation of applications, though, the inappropriateness of traditional availability measures becomes really clear. In a world of big data and applications spanning hundreds or thousands of machines (or, indeed, spanning multiple data centers), trying to apply measure designed for a world of small apps running on expensive hardware seems pointless. Perhaps it's time to rethink the concept of SLA.

As a final thought, it seems that a new generation of mainstream business applications are on the horizon—think Google architecture for ERP. Instead of creating islands of huge cost pools (enormously expensive software put onto expensive clustered hardware) companies will use external providers who write the apps to leverage massive cheap hardware redundancy with a software architecture designed for large pools of hardware resources—all paid for with a per user per month fee. And don't kid yourselves that companies won't jump to this kind of offering if it comes in at 20% of the loaded cost of the on-premises version. I recently listened to a really interesting podcast about Workday, a new HR SaaS provider. According to the speaker, they've picked up several customers who balked at the cost of paying for an on-premises software upgrade. It's amazing how willing people are to take on some additional risk to save a ton of money. The future is going to be interesting.

Bernard Golden is CEO of consulting firm HyperStratus, which specializes in virtualization, cloud computing and related issues. He is also the author of "Virtualization for Dummies," the best-selling book on virtualization to date.

Follow Bernard Golden on Twitter @bernardgolden. Follow everything from CIO.com on Twitter @CIOonline.

In this paper, Forrester Consulting examines the total economic impact and potential return on investment (ROI) realized by three Enterprise organizations as they virtualized mission-critical Oracle databases on the VMware vSphere platform. The purpose of this study is to provide readers with a framework to evaluate the potential financial impact of VMware vSphere on their organizations.
Even though virtualization has brought positive change to enterprise IT over the last decade, some skepticism remains about how valuable virtualization can be in the way companies deliver and run business applications. Uncover the truth about how you can run your business critical applications with confi dence without sacrifi cing
availability or service quality-and at lower costs.
This IDG whitepaper highlights key findings based on the Quickpoll Survey conducted with more than 300 Enterprise and Commercial IT decision makers worldwide about the state of their virtualization of business critical applications. This paper answers such questions as: What drivers are pushing companies to extend virtualization beyond servers? and What value are they realizing? Central to the paper are key results that expose risks of the past (fears of limited ISV support, performance impact) no longer are a factor for companies moving to 80+% virtualized.
The Kelley School of Business at Indiana University deployed VMware Infrastructure which decreases costs, streamlines server deployment, and reduces energy consumption.
New study quantifies how VMware improved TCO and ROI for three companies' IT landscapes.
This IDC white paper explains how much of the Enterprise IT community is at a crossroads in extending their journey to the private cloud: Companies must virtualize their business critical applications in order to reap the benefits of cloud computing. The paper also includes two case studies and a sidebar highlighting the experiences of three enterprises with virtualizing their business-critical applications, which include Oracle and Microsoft SQL databases, SAP and enterprise Java, and a Microsoft Exchange email system.
As greater numbers of datacenter servers transition from the physical to the virtual world, the components of virtualization success come to the fore. What scores of organizations have discovered is that success is derived from an optimal pairing of the right software platform with the right hardware platform.
Virtualizing business-critical applications is an essential step in your journey to the cloud. Microsoft SQL Server, Exchange and SharePoint, and Oracle applications, are often the backbone of business IT. The benefits of virtualizing these applications extend far beyond mere consolidation. Understanding how VMware improves quality of service and agility while reducing costs will help you make the case for taking virtualization to the next level in your company.
Virtualizing business-critical applications has become a key focus for organizations as they move along their virtualization journey. With the launch of VMware vSphere® 5, VMware is helping customers accelerate the deployment of business-critical applications, including Exchange, SQL, SAP and Oracle.
Want to say goodbye to missed SLAs? VMware can help you virtualize mission-critical applications such as Oracle, MS Exchange and SharePoint to achieve dramatic improvements in uptime, performance and responsiveness. In this webcast, we'll discuss the key benefits of virtualizing your agency's most critical applications and Oracle databases as a necessary first step in fulfilling OMB's mandate to move IT services to the cloud. With VMware, you'll be on the way to quick, effective and full compliance.
Federal IT managers are on the forefront of realizing the benefits that a secure, easy-to-manage virtual desktop environment can provide. The key is how to deliver the end-user experience that is comparable to a physical desktop. This webcast will show how the recently released VMware View 5 environment is being used to deploy virtual desktops to provide mission-critical solutions around Disaster Recover/COOP, telework and secure mobile applications to federal organizations. View this webcast and learn how new features and benefits of the VMware View 5 environment meet the needs of Federal customers
This video webcast is designed to help those with little to no virtualization experience understand why virtualization and VMware are so important to driving down both capital and operational costs. The session will start with the introduction of the key concepts and technologies of virtualization, introduce the vSphere Hypervisor, and build up to an overview of VMware vSphere® 5, the world's most robust and complete virtualization platform. This session will also discuss new solutions such as the vSphere Storage Appliance and VMware GO that are making it easier than ever before to get started with virtualization.
Newsletter Sign-Up »

Receive the latest news test, reviews and trends on your favorite technology topics

Choose a newsletter
  1. View all Newsletters | Privacy Policy
Resource Center