How Long Will Big-Name Customers Like Netflix Put Up with Amazon Cloud Outages?
On Christmas Eve as Netflix customers cuddled up to watch their favorite holiday movies and TV shows with friends and family, many experienced a problem. Netflix was down.
Thu, January 03, 2013
More precisely, Amazon Web Service's public cloud, which Netflix relies on to stream content to customers, experienced an outage in its U.S.-East region, the same spot that has been plagued by some of the company's other biggest blunders of its public cloud services.
Netflix is a poster-child customer for using AWS services at large scale. At Amazon's first-ever user conference in November, Netflix CEO Reed Hastings participated in a keynote Q&A with AWS CTO Werner Vogels, reiterating the value AWS provides to the company. Netflix cloud guru Adrian Cockcroft gave speeches to standing-room only conference rooms at the show giving advice to customers on how to architect AWS clouds for fault tolerance and high availability.
While Cockcroft and Vogels have repeated many times that outages are inevitable, the timing of this most recent Christmas Eve crisis at Netflix and Amazon has some asking the question: How long will Netflix and other big-name customers put up with Amazon cloud outages?
WE'VE BEEN HERE BEFORE: Amazon cites cause of recent outage, issues refunds
BUZZBLOG: Why all Amazon's apologies all sound the same
In a post-mortem report, Amazon says an employee accidentally deleted information that controls Elastic Load Balancers (ELB) in its cloud around 12:30 PT on Dec. 24. The maintenance process was thought to be running on a test environment, but in fact it was on production workloads. The deleted data did not allow new ELB configurations to be created, which allow customers to spread workloads across multiple virtual machines. A first attempt to fix the problem by replacing the deleted data failed, and the successful replacement of the data did not occur until 5:40 a.m. Christmas morning. By 10:30 a.m. almost all issues had been restored, but not before AWS estimates that 6.8% of the company's running ELBs were impacted.
Netflix says the timing of the event was actually a good thing. In a blog post describing the incident Cockcroft says Christmas Eve is traditionally a slow time for Netflix compared to Christmas Day. Select customers who access Netflix streaming to their televisions from gaming consoles and mobile devices had unavailable or spotty service for seven hours. Netflix is designed, Cockcroft says, for the failure of a single Availability Zone within AWS's cloud, but not for a service that spans multiple Availability Zones and an entire region to go down, which is what happened with the ELBs. Netflix engineers are working on creating regional resiliency to prepare to the next outage, he says.
Best Places for IT Pros to Work in 2013
Picking the Top Android Office Productivity Suite
10 iOS 7 Features That Could Make Enterprises Smile
10 Hot Big Data Startups to Watch

