In June, Amazon Web Services (AWS) suffered two high-profile outages that left a number of customers—including clients like Instagram, Pinterest and Netflix—unable to provide services to their customers. But do AWS’ clients share some of the responsibility?
“There’s blame on both sides of the equation,” says Jason Currill, founder and CEO of Ospero, a global Infrastructure as a Service (IaaS) company. “From the Amazon side, clearly I think there’s an obvious issue with their redundancy power. After one outage, you’d think they’d have learned their lesson. Redundancy power is one of those elementary things that data centers are normally very, very good at.”
Organizations Must Treat Cloud Providers as a Utility
But the clients affected by the outage also share blame, Currill says, because they failed customers using their services.
“If you’re a corporation and you have a building, you have a diesel generator in the basement in case the electricity goes out,” he says. “You have two telco lines coming in so if you lose one, you still have communications. Cloud is the same thing. It’s a utility. Have two.”
Generator Failures, Software Bug Cause Outage
In a detailed post-mortem released after the most recent outage on June 29, Amazon cited a series of power outages, generator failures and rebooting backlogs that led to a “significant impact to many customers.”
The problems began with a large-scale electrical storm in northern Virginia, in what Amazon designates its U.S. East-1 Region. U.S. East-1 consists of more than 10 data centers structured into multiple availability zones. This structure is designed to prevent exactly the sort of problem that occurred on June 29; availability zones run on their own physically distinct, independent infrastructure.
Common points of failure like generators and cooling equipment are not shared across availability zones. In theory, even disasters like fires, tornados or flooding would only affect a single availability zone and service should remain uninterrupted by routing around that availability zone to the others.
But on that Friday, when a large voltage spike occurred in the electrical switching equipment in two of the U.S. East-1 data centers, there was a problem bringing the generators online in one of the affected data centers.
The generators all started successfully, but each generator independently failed to provide stable voltage as it was brought into service. Since the generators weren’t able to pick up the load, the data center’s servers operated on their Uninterruptible Power Supply (UPS) units. The utility restored power a short time later. But then power went out again. Again the power failed to transfer to the backup generators and again the servers operated on their UPS power.
About seven minutes later, the UPS systems were depleted and the servers began to fail. Within 10 minutes, onsite personnel managed to stabilize backup generator power and restart the UPS systems, and 10 minutes after that the facility had power to all its racks. But the damage had been done.
The lack of power brought down the Elastic Compute Cloud (EC2) and Elastic Block Store (EBS) services in the affected availability zone, preventing customers from creating new EC2 instances, create EBS volumes or attach volumes in any availability zone in the U.S. East-1 region between 8:04 and 9:10 p.m. PDT.
“The vast majority of these instances came back online between 11:15 p.m. PDT and just after midnight,” AWS said in its post-mortem. “Time for the completion of this recovery was extended by a bottleneck in our server booting process.”
The bottleneck was a result of a bug in Amazon’s Elastic Load Balancers (ELBs), a tool for high availability that allows Web traffic directed at a single IP address to be spread across many EC2 instances. When deployed across multiple availability zones, the ELB service maintains ELBs redundantly in the availability zones a customer requests them to be in so that failure of a single machine or data center won’t take down the end point.
That Friday, when the availability zone went down, the ELB control plane began shifting traffic to account for the loss of load balancers in the affected availability zone. When power and systems came back online in the affected zone, a bug caused the ELB control plane to attempt to scale these ELBs to larger ELB instance sizes, resulting in a sudden flood of requests that created a backlog. The problem was exacerbated because customers began launching new EC2 instances to replace capacity lost in the downed availability zone. The ELB control plane managed requests for the U.S. East-1 region through a shared queue, causing it to fall behind in processing the requests.
Amazon has promised to resolve the bottleneck issue, but for some AWS customers, the outage was the straw that broke the proverbial camel’s back. For online dating site WhatsYourPrice.com, which went down during a period of prime activity for its users, AWS’ unpredictable data center issues were a sign it was time to leave EC2 for a Las Vegas-based hosting facility.
“Amazon’s failure has negatively affected our Web site’s reputation as a reliable online dating destination,” says Brandon Wade, founder and CEO of WhatsYourPrice.com, which was flooded with thousands of member complaints. “One hundred percent uptime is a required SLA for anyone providing cloud computing services. Amazon’s inability to provide such service levels is the main reason we have decided to quit using AWS EC2 altogether.”
Changing Hosts Is Not Enough, You Need a Redundant Cloud
But Ospero’s Currill says that changing hosts because the host has gone down doesn’t actually resolve the problem. Instead, companies seeking to leverage the cloud should also make sure that they make use of its capability to create geographically redundant links.
“Putting all your eggs in one basket is clearly a good and bad strategy, good because you get to be a big customer of a provider, you get economies of scale, better pricing, someone should pick the phone up when you call etc. etc.,” Currill says. “Bad because you give away some control, when AWS went down for many infrastructure teams at customer sites who may have engineered their application to be redundant inside their host, clearly what they didn’t take into account was the unthinkable-what happens if the host goes down?”
Relying on a single provider to manage your entire infrastructure without a disaster recovery/back-up strategy with another cloud provider is “commercial madness” he says.
“Now I understand there is a cost element here, the cost of replicating some or your entire infrastructure to spin up when a disaster happens is expensive, isn’t it? Well, yes and no,” he says. “Yes, it’s going to add some level of cost, but what you get from that is control—you the system admin from Pintflixogram get the control that if your primary host goes down, you get to fire up another secondary host and maintain service,” Currill says.
“Let’s remember,” Currill says, “AWS is not the only hosting company on the planet; perception dictates that sometimes people think they are, but there are plenty of regional outfits in the market that are not as cheap as AWS, but guess what—they don’t go down. And no, if you balance the reputational risk, the customer support calls you have to field, the tickets raised, the PR damage limitation exercise and finally the churn as your customer base leaves for your competitor, then no, it’s not expensive.”
Thor Olavsrud covers IT Security, Big Data, Open Source, Microsoft Tools and Servers for CIO.com. Follow Thor on Twitter @ThorOlavsrud. Follow everything from CIO.com on Twitter @CIOonline and on Facebook. Email Thor at email@example.com