I was looking through the program for an upcoming cloud computing conference and noted a number of sessions devoted to negotiating contracts and service level agreements (SLAs) with cloud providers. Reading the session descriptions, one cannot help but draw the conclusion that carefully crafting an SLA is fundamental to successfully using cloud computing.
The sessions described at length how they would help attendees with cloud computing topics like:
- Definitions of uptime, availability and performance
- Negotiation techniques in crafting an SLA
- What factors to include in an SLA: virtual machines availability, response times, network latency, etc.
- Negotiating penalties for SLA violation
Having sat through a number of discussions on the topic of SLAs, these session descriptions ineluctably brought to mind the following truth: SLAs are not about increasing availability; their purpose is to provide the basis for post-incident legal combat.
However, none of the sessions pointed this out. The session descriptions seem to suggest that clever SLA negotiation somehow ensures that one's applications will be immune to outages.
In fact, nothing could be further from the truth.
The reality is that all infrastructures will have outages of one sort or another. While careful assessment of a provider's capabilities may enable selection of a more robust provider and paying a higher fee may ensure faster response or communication with a dedicated response team, there is no immunity to outages. No matter how much time you spend crafting an SLA agreement, it will not ensure 100 percent uptime.
So why do people obsess about SLAs?
For one, it gives people a sense of control. Sitting in a room, insisting on special treatment, redlining a contract and replacing one set of language with another all make people feel like they're asserting their dominion. And that feels great. But don't imagine that you're going to fundamentally change the provider's contract. I learned this at an SLA presentation given by an attorney. After devoting 90 minutes to the minutiae of contracts, he concluded by saying, "Of course, you won't be able to change the standard contracts much, because they're written to reduce the provider's responsibility. What you're discussing is how much of a service credit you're going to receive."
People also obsess about SLAs because they provide a basis for post-outage haggling. Being hard-nosed up front may mean greater compensation later. But keep it in perspective: no matter how much you haggle, you're not going to be fully compensated for the business loss resulting from the outage.
To reiterate: SLA compensation is limited to a credit against the cost of the service—not the user's cost of the outage—and the cost of the service is often a tiny percent of the cost of the outage.
Here's an example that a former employee of one of the largest outsourcing companies shared with me: Their very large retail customer's website crashed on Black Friday. The application was down for six hours, resulting in a loss of $50 million in revenue. The outsourcer's compensation to the retailer? Six hour's service credit—approximately $300.
The morale of this story? Keep the whole SLA discussion in perspective. It's not going to make you whole if your application is unavailable.
The worst thing about investing too much energy into SLA haggling is that it may distract you from the far more important issue: how to ensure uptime. If you're on the Titanic, and it hits an iceberg and sinks, all of the time you spent negotiating the location and conditions of your deck chair isn't going to help your prospects one bit.
The most important issue is, how should you think about application outage and what are your options for improving uptime?
As a starting point, keep in mind Voltaire's observation: "Le mieux est l'ennemi du bien." Loosely translated, that means, Perfection is the enemy of good. Applied to cloud computing, this might be thought of as "Don't avoid adopting a cloud provider because it can't guarantee 99.999 percent uptime, when one's own data centers fall far short of acceptable uptime."
If adopting cloud computing improves uptime significantly, it's the right thing to do. If there are no actual statistics of the uptime availability of one's own computing environment, that's a telling sign that moving to a cloud provider is a step in the right direction. It may not be perfect, but it's way better than an environment that can't even track its own uptime. Believe me, there are many, many IT organizations with nothing more than earnest assurances about their uptime performance.
Here are some steps you can take to improve your application uptime:
1. Architect your application for resource failure. Perhaps the greatest single step you can take to improve your application's uptime is to architect it so that it can continue performing in the face of individual resource failure (e.g., server failure). Redundancy of application servers ensures the application will continue working even if a server outage kills a virtual machine. Likewise, having replicated database servers means an application won't grind to a halt if one server hangs. Using an application management framework that starts new instances to replace failed ones ensures redundant topologies will be maintained in the event of an outage.
2. Architect your topology for infrastructure failure. While judicious design can protect application availability in the event of an application hardware element failure, it can't help you if the application environment fails. If the entire data center that one's application runs in goes down, use of redundant application designs is futile. The answer in this case is to implement application geographic distribution so that even if a portion of one's application becomes unavailable due to a provider's large-scale outage, the application can continue to operate. This makes application design more complex, of course, but it provides a larger measure of downtime protection.
3. Architect your deployment for provider failure. Of course, it is possible for a cloud provider's complete infrastructure to go offline. Even though the circumstances under which this might occur are quite rare, it is within the realm of possibility. For example, the provider's entire network infrastructure could down, or the cloud provider might abruptly shut down. Far-fetched, perhaps, but both scenarios have happened with online services in the past. The solution is to extend your application's architecture across multiple providers. Despite what many vendors will proclaim, doing so is extremely challenging because the semantics of how cloud providers vary makes it difficult to design an application that can incorporate differing functionalities. Nevertheless, it is possible to implement this application architecture with sufficient planning and careful design.
What should be obvious from this discussion is that higher levels of uptime certainty require increased levels of technical complexity, which translates into increased levels of investment.
Deciding whether a given application requires this level of investment is an exercise in risk assessment. Certainly this should be an explicit exercise, in which the tradeoffs between business exposure, investment, and technical operations complexity are evaluated. It's no easy exercise, and there probably aren't any easy answers. However, doing this is much more likely to result in an acceptable outcome than an extended, though futile, SLA contract slugfest.
Bernard Golden is CEO of consulting firm HyperStratus, which specializes in virtualization, cloud computing and related issues. He is also the author of "Virtualization for Dummies," the best-selling book on virtualization to date.