I serve as the co-chairman of the SVForum Cloud and Virtualization SIG, based in Silicon Valley. Thanks to our location, we are able to call upon an array of innovative and interesting speakers and attract a sophisticated, knowledgeable audience.
Last week’s SIG meeting was one of the most interesting we’ve had in our more than three-year history. Its title was “Cloud Computing the Netflix Way,” and we had two Netflix guest speakers: Adrian Cockcroft, director of architecture for cloud systems (his presentation is here), and Jason Chan, Netflix’s security architect (his presentation is here).
If you’re not familiar with its technology infrastructure, Netflix has, over the past few years, migrated almost entirely from an on-premises data-center environment to a cloud-based setup located in the Amazon Web Services infrastructure.
As a follow-up to my recent CIO.com article titled “Cloud Computing Calls for Rebuilding Enterprise IT,” these two presentations are almost a perfect complement. Learning about what Netflix has done is an excellent primer about what, in my opinion, most enterprises will go through in the future.
One thing to be addressed right at the start: Is Netflix a good model for enterprise IT? When Netflix is brought up as a model, I hear many people respond along the lines of, “Well, it only has a couple of applications. We have thousands.” Or, “It’s an online company and not really an enterprise,” the implication being that what Netflix did is not applicable to “our” situation and environment, so it’s interesting but not really germane.
Nothing could be farther from the truth.
First, Netflix started its journey with a traditional enterprise environment and a traditional data-center infrastructure. It found that the infrastructure was too fragile for its needs (i.e., things stopped working all the time), and the traditional operations model didn’t respond fast enough to the needs of the business. Netflix changed its approach because it recognized that the future of its business required a different way of doing things.
Second, companies are starting to look more and more like Netflix in terms of offering online services as a core part of their business. Think of GM and its OnStar service. If you’ve taken a Virgin America flight and seen the future of in-cabin entertainment, do you think it’s not collecting and analyzing that data to tune its offerings to individual customers? What Netflix is doing, company after company is doing as well. So, for at least a portion of their applications, most companies are starting to resemble Netflix. And one thing is for sure — managing the new type of applications with the practices and processes associated with existing applications is a recipe for disaster.
Pulling the Plug on the Data Center
As to why Netflix decided to get out of the data center business? Short answer, there are two reasons.
Netflix’s business is growing rapidly and experiences very uneven demand (highly skewed toward evenings, when, by some accounts, its video streaming service represents 29 percent of all Internet traffic). In this kind of environment, Netflix didn’t want to experience service interruptions due to its inability to build data centers fast enough.
Even though Netflix is a highly technical organization, it wasn’t as good as Amazon when it came to automating data-center operations. Rather than try to replicate that ability, Netflix chose to leverage a highly efficient, low-cost expert provider.
What’s interesting is how Netflix goes about creating and running its environment.
Like a lot of online companies, it has blurred the concept of “release to production.” In fact, as I interpret Cockcroft’s presentation, it has blurred the concept of release. Rather than a release being a static collection of bits that are moved from one lifecycle stage to another, an application is composed of many, many fine-grained services. Each release may be thought of as a collection of services at a given point in time.
This implies that each service must deliver high availability and be failure-resistant. In some sense, the Netflix architecture represents the apotheosis of SOA, with all the associated “abilities” that such an architecture carries — reliability, manageability, etc. For example, given that underlying infrastructure is fragile, the services are implemented with redundancy, failover and automatic restart.
Also, applications and automated monitoring constantly check the performance and latency of services. In the case of applications, they are written to call the services asynchronously, so that if one fails, the application does not hang, but moves on with a small piece missing or with slightly stale cached data. The monitoring mechanism constantly watches service performance and, if it observes intolerable variances it will initiate a set of specific automated steps. If the service performance problem persists, the system will raise alerts to ensure that human attention is directed to the problem.
This can be taken even further. Since the underlying infrastructure can be untrustworthy, Netflix spreads its processing across many different Amazon data centers and regions. This makes it more complex and more challenging to operate, but it safeguards Netflix from even large infrastructure outages. (Netflix was notably unaffected by last April’s AWS outage, when many “Web 2.0” companies found themselves offline as a result of their decision not to absorb the additional cost and complexity of distributing their applications more widely across Amazon’s infrastructure.)
Then, if your application is composed of many services that are failure-prone, and your application architecture is written to be failure-proof for services, it makes sense to deliberately shut down portions of your production environment to see if the application is truly robust. Netflix famously does this with what it calls its “chaos monkey,” in which different service environments are randomly taken offline to confirm that the Netflix environment can continue operating in the face of resource failure. One thing that came out of the presentations is that Netflix has many monkeys, not just one. They do different things, but they all focus on validating the robustness of the environment when confronted with resource failure.
Of course, if the concepts of release to production — and release itself — are called into question, so too is the role of operations. Netflix does not have a separate operations group for its cloud infrastructure — every developer is responsible for putting his or her code into production and is called when something breaks. Cockcroft has caused a bit of a ruckus in the cloud community by calling this “NoOps,” in contrast to “DevOps,” which many operations-focused folks feel is the future of large-scale cloud computing applications.
To my mind, the notion of fine-grained service in continuous deployment puts to rest the concept of a separate operations group responsible for putting applications into production and keeping them running. I believe Cockcroft is somewhat overstating the situation, as there are people tracking the service monitoring and ensuring that any performance and latency issues get addressed. The larger point is that the new model of applications requires a radical rethinking of application architectures, differing ways of moving fine-grained services through their individual lifecycle, differing ways of monitoring an “application,” and differing ways of ensuring robustness. As I said last week, cloud computing requires rebuilding enterprise IT for a completely new operating model.
Well, if all that is changing, how is security handled in the Netflix environment? Jason Chan’s presentation was eye-opening, to say the least. Chan has a long history in security. Before joining Netflix led the security team at VMware, so he knows whereof he speaks.
I found his perspective on security quite unusual for a “typical” security person. He led off by stressing that risk is the appropriate arbiter of what security practices should be implemented. Then he discussed how Netflix goes about implementing security. In light of Cockcroft’s presentation, it seems appropriate that Netflix creates services to implement common security measures. Developers can self-service under this model, which keeps them productive while ensuring that what is implemented meets security requirements. And it should come as no surprise that there is a “security monkey” to validate security practices within Netflix services.
Chan went on to note that using a public cloud environment poses challenges to the traditional methods of implementing security, but that overall, Netflix does not feel it has compromised its security by using AWS. The specifics of how Netflix has achieved its security stance are contained in Chan’s presentation, and reviewing it is well worth the time.
Perhaps the most interesting thing about Netflix is how it approached the overall proposition of using a cloud computing environment. It didn’t focus on how to make the cloud support their established application architectures and IT processes. Instead, it evaluated its applications and operations to understand how the new environment would affect the compute infrastructure and redesigned the applications to address that. If your organization is looking to aggressively move into cloud computing and is willing to examine what is required to truly leverage a cloud environment, the Netflix story is a critical example to understand.
Bernard Golden is the vice president of Enterprise Solutions for enStratus Networks, a cloud management software company. He is the author of three books on virtualization and cloud computing, including Virtualization for Dummies. Follow Bernard Golden on Twitter @bernardgolden. Follow everything from CIO.com on Twitter @CIOonline