Ensure Cloud Application Resilience the Netflix Way

These days, there's no such thing as a stable system, especially in the cloud. But most outages can be blamed on application architecture, not infrastructure. To combat this, do what Netflix does: Put your apps through a ringer that breaks them and fix your software before your customers suddenly can't use it.

By Bernard Golden
Tue, February 19, 2013

CIO — One of the most heated topics in cloud computing today is the Service Level Agreement (SLA). From the highly charged discussion on the subject, you might expect that the primary factor affecting application availability is the willingness (or unwillingness) of a cloud service provider (CSP) to sign up to a rigorous SLA.

In fact, the application itself is biggest factor affecting application availability. That's what all the furor about cloud SLAs is really about—how available are my applications, because that's what's important to me, and an SLA is a somewhat correlated means to that end. More application outages are caused by what's going on in the application than are ever caused by infrastructure failure—and this is becoming even more true because of the increasingly complex nature of applications.

Unlike the simple client-server or even straightforward multi-tiered, single-machine-at-each-tier applications of the past 20 years, today's applications are a complicated méelange of multi-tier, horizontally-scaled instances (that is, virtual machines) containing aggregations of software packages, calling internal and external services, and operating in highly variable load conditions that cause application topologies to constantly shift as new instances join and leave the application. The old model of resilience—"If it's not broke, don't touch it"—just won't work in this environment.

There's No Such Thing as a Stable System

It is the nature of such applications that complex interactions between application components execute thousands of times per second. It's likely that the same execution path for a user interaction may not occur for days at a time, given the state of the user's session, the actions the user takes and the then-current topology of the application.

It might not, in fact, be wrong to say that the same execution path will never be followed a second time, given the shifting nature of the entire application. Compared to this environment, the CSP infrastructure is highly unlikely to be the only, or even primary, cause of application outages.

Analysis: Do Customers Share Blame in Amazon Outages?

Obviously, the new model of application architectures and topologies means that the traditional solution associated with application resilience—install the application, then don't change it for as long as possible—is no longer workable. In fact, in a fascinating podcast I recently listened to, Richard Cook, an expert in complex systems, claimed that there is no longer such a thing as a stable system. Between system changes, maintenance schedules, operations activities and user interactions, one cannot even apply a model of "stability." The assumption that applications are simple collections of a limited number of components with well-understood execution paths and consistent performance characteristics is no longer tenable.

Continue Reading

Our Commenting Policies