Thursday, April 21, is a day that Michael Downing, the CEO and CFO of social media start-up Tout, won’t soon forget. In the wee hours of the morning, Downing learned a harsh lesson: cloud computing is not bulletproof.
Amazon Outage Sparks Frustration, Doubts About Cloud
Mitigating the Risk of Cloud Services Failure: How to Avoid Getting Amazon-ed
Tout, which had launched its real-time video status update service a week and a half earlier, was among the numerous customers taken down by Amazon’s EC2 outage. Not only was the main database, which houses critical account information, impacted, but Downing also quickly learned that the company’s application server partner, Heroku, also was an Amazon customer — and offline. “The first 90 days is the critical time when you’re trying to establish your brand and you build momentum. That wasn’t possible when our systems were at a complete standstill,” Downing says.
Before this incident, Downing was proud that more than 90% of his applications were being hosted in the cloud so the company could get off the ground without the shackles of high infrastructure costs. “I’ve trusted and used cloud services for years and this technology is transformational for the start-up world,” he says.
That trust is now irrevocably broken, he says. While Heroku came back online relatively quickly, his database remained down for almost 48 hours. At some point, after little communication from Amazon about a fix, Downing and his team uploaded a three-day-old snapshot of the database to a server at another Amazon location — far from the ailing Virginia data center. “Although we permanently lost some data, we were at least able to get back online,” he says.
As much as a week after the incident began, Downing says that Amazon still hadn’t been in touch with him to explain the outage that we now know stemmed from a configuration error, other than generic, mass messages. “Part of the whole value proposition when you sign on for these services is there will be no one single point of failure and even if a whole node goes down, your systems won’t be tanked. This was a huge eye opener that proved that is definitely not the case,” he says.
As this story was being published, Amazon hadn’t responded to a request for comment.
Already Downing and his team have taken significant steps to ensure they are never caught like this again. “We’ve had a series of meetings here internally to review all points of failure in our cloud strategy. We’re digging deeper to find out where data is hosted and what the backup plans are for that data,” he says. He adds that he’ll be hosting the main database at an additional cloud service for redundancy — a cost he calls blatantly necessary in light of this situation.
Do the Drill-Down
Downing encourages his fellow executives, including CFOs, to do similar drill-downs on their services to avoid major outages, or at least understand the risk. It’s an exercise that Matt Johnson, acting CFO of HealthBridge Inc. — and also its CEO — is engaged in right now.
Although HealthBridge, which provides in-home care for 80 to 100 clients in North Texas, was not impacted by the Amazon outage, Johnson says it did make him stop and think about the company’s cloud-based strategy. Like Tout, a large majority of HealthBridge’s applications are hosted in the cloud on services such as Google, HubSpot, and Salesforce.com.
The company already keeps a physical copy of important patient documents in the home for the caregiver to reference. However, information such as that day’s plan of care is often updated online. “We rely heavily on these services to communicate and to care for our clients. A little bit of downtime to some is an inconvenience; for us it impedes our entire client service operation,” Johnson says.
HealthBridge, in its efforts to contain costs because it serves fixed-income seniors, does not have an IT staff. Therefore, Johnson says he’s going to hire a CFO who can oversee cloud services. “We’ll now be looking for someone who can think through controls and processes surrounding cloud applications for not only compliance, but also disaster recovery,” he says.
‘A Bad Blind Spot’
He wants his future CFO to closely track the data created in the cloud, and features sets used there, to make recovery easier. Johnson was hoping eventually to eliminate paper copies as the company scales up. But he says this event has proven that a paper subsystem is needed as a failsafe.
“While the cloud still makes growth less expensive, it makes you thin on IT, and overly dependent on these providers,” he says. “That’s a bad blind spot.”
While Downing and Johnson are stunned by the Amazon downtime, Robert Band, president of a newly launched financial management firm in Miami called CFO & Co., believes that companies have unrealistic expectations for these new cloud services. “My feeling is the cloud industry is a toddler, and it’s going to stumble and bruise itself. They’ll learn from this outage and become far more agile,” he says.
As to whether he still supports cloud computing? “This technology is too compelling and the advantages are too great not to. Especially for small to midsize businesses that can’t afford to come up with out-of-pocket infrastructure investments every few years,” he says.
Going forward, he plans to follow this mantra: “The best you can do is choose your provider as carefully as possible, have access to IT in-house or as a consultant, and do everything you can to engineer redundancy into the system. In other words, plan for the worst and hope for the best.”