Three Lessons From Netflix on How to Live in the Cloud
Netlfix is a big company, and a big cloud user. With 38 million members across 40 countries, it streams a billion hours of content per month.
Wed, October 09, 2013
Almost all of the Netflix's customer-facing services like a massive database that creates personalized content recommendations based on prior viewing history are run in Amazon Web Service's public cloud.
The company has a content-delivery platform named Open Connect that it manages with partnering ISPs to actually stream movies to users.
As one of the biggest cloud users in the world, the company has gleaned lessons from its operations. Below are three takeaways of how the company approaches using the cloud from Ariel Tseitlin, director of cloud solutions for Netflix, who spoke at the Massachusetts Technology Leadership's Cloud Summit on Tuesday.
One Netflix goal is to create the smallest level of abstraction as possible for each application to minimize the effect of any downtime or service failure in the cloud. If this is done successfully, it drastically reduces the "blast radius" of any cloud outage, says Tseitlin, who's responsible for building out the company's cloud and ensuring its reliability.
For example, if Netflix's personalization service goes down, then the company defaults to a more generic recommended movies list that will suggest the most popular titles, but not necessarily those personalized to the user. That minimizes the snowball effect of one service bringing down others.
Build in redundancy
It's one thing to have functionality of applications and services deployed to the cloud at granular levels, it's another to scale it and ensure it works all the time. That's why Netflix has horizontally scaled its service across the globe. Each service is deployed to at least three Availability Zones (AZ), which are isolated locations within Amazon's cloud. AWS recommends deploying to at least two AZs for its service-level agreement (SLA) to kick in. Not only are Netflix services deployed to three AZs, but they are each scaled independently so that if an AZ fails then load balancers migrate traffic to the healthy AZ.
In addition to scaling to multiple AZs, the entire Netflix service is replicated across two regions within Amazon's cloud both U.S. East and EU West and replicated asynchronously. The idea is that if an entire region in Amazon's cloud were to fail then the service would still be available.
Even with monitoring and alerts that cover the entire operations of Netflix, failures will still happen. That's why the company has built a platform for monitoring its service and fixing mistakes. The Simian Army is a series of open source tools that have been developed internally by Netflix that test the fault tolerance of the company's operations. Chaos Monkey is one that randomly kills various services to test failure at the application layer. Chaos Gorilla is another that brings down an entire AZ to test for high availability. Chaos Kong is a service in development that Netflix hopes to use to eventually test an entire region shutting down. Tseitlin says that Netflix is so concerned with testing and monitoring that it jokingly refers to itself as a monitoring company that occasionally delivers movies.