by Bernard Golden

Ensure Cloud Application Resilience the Netflix Way

Feb 19, 20137 mins
Cloud ComputingCloud ManagementDeveloper

These days, there's no such thing as a stable system, especially in the cloud. But most outages can be blamed on application architecture, not infrastructure. To combat this, do what Netflix does: Put your apps through a ringer that breaks them and fix your software before your customers suddenly can't use it.

One of the most heated topics in cloud computing today is the Service Level Agreement (SLA). From the highly charged discussion on the subject, you might expect that the primary factor affecting application availability is the willingness (or unwillingness) of a cloud service provider (CSP) to sign up to a rigorous SLA.

In fact, the application itself is biggest factor affecting application availability. That’s what all the furor about cloud SLAs is really about—how available are my applications, because that’s what’s important to me, and an SLA is a somewhat correlated means to that end. More application outages are caused by what’s going on in the application than are ever caused by infrastructure failure—and this is becoming even more true because of the increasingly complex nature of applications.

Unlike the simple client-server or even straightforward multi-tiered, single-machine-at-each-tier applications of the past 20 years, today’s applications are a complicated méelange of multi-tier, horizontally-scaled instances (that is, virtual machines) containing aggregations of software packages, calling internal and external services, and operating in highly variable load conditions that cause application topologies to constantly shift as new instances join and leave the application. The old model of resilience—”If it’s not broke, don’t touch it”—just won’t work in this environment.

There’s No Such Thing as a Stable System

It is the nature of such applications that complex interactions between application components execute thousands of times per second. It’s likely that the same execution path for a user interaction may not occur for days at a time, given the state of the user’s session, the actions the user takes and the then-current topology of the application.

It might not, in fact, be wrong to say that the same execution path will never be followed a second time, given the shifting nature of the entire application. Compared to this environment, the CSP infrastructure is highly unlikely to be the only, or even primary, cause of application outages.

Analysis: Do Customers Share Blame in Amazon Outages?

Obviously, the new model of application architectures and topologies means that the traditional solution associated with application resilience—install the application, then don’t change it for as long as possible—is no longer workable. In fact, in a fascinating podcast I recently listened to, Richard Cook, an expert in complex systems, claimed that there is no longer such a thing as a stable system. Between system changes, maintenance schedules, operations activities and user interactions, one cannot even apply a model of “stability.” The assumption that applications are simple collections of a limited number of components with well-understood execution paths and consistent performance characteristics is no longer tenable.

Just as obviously, the traditional solution for assessing application resiliency, assessing if the application is up and no user is actively complaining, is no longer workable either.

Cloud Application Performance Management Can Only Do So Much

To address this problem, companies typically turn to a class of tools that offer application performance management, or APM. These tools mimic end user interaction to evaluate user experience, perform detailed monitoring of software components, and provide analytics across time to identify trends in application and performance. This approach to resilience might be called “If it’s not broke, watch and get ready to fix it.”

This is all well and good. But it’s not enough. While understanding how the application is operating is helpful for managing typical use patterns, no APM can help you address problems that are going to arise because of the complexity and continuous change associated with today’s applications.

Simply put, it’s not enough to run the app, attach APM and expect things to go well—or even to expect the problems that will arise to be well-bounded. The application elements that will cause problems are unknown, the triggering events unpredictable. Therefore, the application problems one can expect to see in today’s environments require more than waiting and responding when a problem arises.

How-to: Improve Application Performance and Reduce Latency

More important, the fixes one must make to address the problems that will be seen are unknowable in advance. This means outages may be lengthy as development organizations attempt to sort through layer upon layer of complexity to understand the problem and design a fix. Clearly, if this approach is inadequate for the new application model, something different is required—something that represents an approach that is appropriate for the new application type.

Netflix Uses Army of Monkeys to Make Apps Robust

The company that best represents this new application architecture, architecture and topology is undoubtedly Netflix. It runs a highly decentralized application comprised of independent services that are aggregated to provide specific functionality. Each of the services operates separately, and the resulting application is unique for every user. Finally, the entire collection of services runs in Amazon Web Services.

News: Netflix Releases Customized Amazon Control Console

The approach Netflix has taken to address the problems associated with these new applications reveals a new pattern of resilience, one we will see more in the future as companies move to this new application design orientation. Its approach might be summed up as “If it’s not broke, break it”—to be sure the application is robust in the face of unexpected failures.

Netflix began its resilience approach in a straightforward way. It developed a tool to unexpectedly shut down instances within the underlying AWS infrastructure to ensure its application is robust in the face of resource failure. It dubbed this tool Chaos Monkey. It has continued development of other tools to improve resilience, uses Monkey as a standard naming convention and refers to the collection of tools as the Simian Army.

Janitor Monkey cleans up after the application by shutting down unneeded instances, Security Monkey finds instances with improper security settings and shuts them down and Doctor Monkey tracks instance behavior and shuts down instances that have poor response time or show high resource use without corresponding useful activity.

Netflix even has the Chaos Gorilla, which simulates an outage of an entire AWS availability zone. Netflix is currently AWS region-bound, but is actively exploring how to spread its application across regions to further improve resilience. One can be sure that there will be another tool created to validate resilience in the event of an entire region going down, perhaps it will be called Chaos King Kong.

Fixing Apps In the Heat of the Moment Is a Bad Idea

The basic philosophy behind the Simian Army is that waiting to see how your application responds to problems and then depending upon monitoring tools to do forensic analysis while the application is down resolves the issue too late. For applications that a company depends on for service delivery and revenue generation, post-problem mitigation is unacceptable. It’s far better to force problem scenarios, evaluate how the application responds, and improve components or operations necessary to maintain operation in the face of resource failure. Netflix tends to do this at low-load period when the problems it causes or extra load it imposes will not overtax the system.

IT is moving toward a world where Netflix-like application design and operational topologies will become the norm. Traditional resilience approaches and mitigation strategies won’t work or won’t be acceptable in this world. In any event, attempting to address problems in the heat of a downtime situation is a poor approach, since stress inevitably degrades analytical capabilities and solution quality.

Netflix is making a major open source push by releasing its various application tools under a sharing-friendly license. If you’re looking to the future, evaluate what Netflix is doing and consider how it can be applied in your environment.

Bernard Golden is the vice president of Enterprise Solutions for enStratus Networks, a cloud management software company. He is the author of three books on virtualization and cloud computing, including Virtualization for Dummies. Follow Bernard Golden on Twitter @bernardgolden.

Follow everything from on Twitter @CIOonline, Facebook, Google + and LinkedIn.