Mitigating the Risk of Cloud Services Failure: How to Avoid Getting Amazon-ed
The two-day outage at one of Amazon's data centers has everyone questioning the reliability of infrastructure as a service offerings. Here are seven tips for limiting your risk in the event of cloud services failure.
Mon, April 25, 2011
CIO — One of CIOs' biggest concerns about the infrastructure-as-a-service model has been the loss of control over assets and management that enterprises might experience upon moving into a multi-tenant environment. While analysts and early adopters of infrastructure-as-a-service offerings have argued that such apprehension is rooted more in fear than fact, Amazon's recent public data center debacle has given everyone good reason to question the reliability of the public cloud.
The two-day outage may not slow the long-term growth of cloud computing significantly, but it should cause IT decision makers to take pause. Before rushing into any new cloud infrastructure deal, take the following seven steps to mitigate the risk of infrastructure-as-a-service failure.
1. Plan to fail. Develop detailed cloud breakdown scenarios and perform recovery run-throughs. "Put your risk-mitigation strategy firmly in place before moving into the cloud environment," says Phil Fersht, founder of outsourcing analyst firm HfS Research.
Heather McKelvey, vice president of engineering and operations for Mashery, an API management services provider, agrees. "A lot of people think 'get it up and running' and then we'll put in the design for failover," she says. "You can't do that. [Others] assume that a cloud will failover to another cloud—or one data center to another data center—but there are varying degrees of where problems can happen, and you need to architect and build for all levels of failure in your system, not just the high level."
2. Keep some expertise in house. One of the allures of cloudsourcing is the notion that you no longer have to maintain internal knowledge of the technologies that support as-a-service solutions. However, captive know-how comes in handy when you need to prepare for and react to cloud problems. "I don't see CIOs having much option but to increase in-house knowledge of cloud underpinnings," Fersht says.
If you lack in-house capabilities, ask your provider for help, or consider hiring consultants to create a disaster recovery and business continuity plan. "Even a small investment in third-party risk oversight is worth the investment, if it helps negate a potential disaster in the event of a long outage," Fersht says.
3. Test that plan. Then test it again. "The cloud is the perfect place to test failures in a completely staged environment," says Donald Flood, vice president of engineering for Bizo, a business-to-business advertising network provider and Amazon Web Services (AMZN) customer. "You can easily create a staged environment that mirrors production and test your systems by killing running services and evaluating how your system performs under failure."