Development and operations (DevOps) tools such as Puppet and Chef automate changes to configurations in systems. Some teams use these tools, and other frameworks, to actually automate the creation of the entire production Web server environment—sometimes in public services such as Amazon Web Services, sometimes in a local environment.
The problem with automating the rollouts is that code has bugs. A configuration change meant for QA—say, to direct users to a test environment—could be propagated to production, leaving users logged into an environment that looks real but will never actually ship products. (Don't laugh too hard: Last month this happened to one of my clients, a multibillion-dollar retail operation moving to customer self-service.)
The good news is that, as new risks emerge, so, too, are new techniques to manage those risks. Here are a few things to think about in any cloud transition.
99 Problems—and Code Defects Are One
Amazon's Elastic Compute Cloud has occasional, unpredictable outages. Even without Amazon, if a company uses Chef or Puppet to automate system administration, those tools use code, and that code could have defects.
Here are a few possible problems with a cloud implementation:
- A feature is created in production but disabled by a configuration flag. A programmer turns on the GUI, but the behavior remains "off."
- A private cloud manager designed to roll out new servers over time has a defect in the "reaper" process that turns off old instances.
- Mistakes in the merge process can put test configurations such as databases, server names and URLs into production.
- API issues, especially a third-party API that changes after the code "passes" the test environment.
All these problems could appear first in production. In fact, they're likely to first appear in production, with no visible signs in the test environment. A week of phone calls, interviews and a trip to San Diego to discuss this in person at the Software Test Professionals Conference have led me to conclude that there are no easy answers.
A traditional test approach won't find these problems. Instead, the people I interviewed recommended two things: Either change the architecture to reduce risk or monitor, test and (quickly) fix issues in production.
For Better Software Stability, Change the Architecture
The problem, at least according to Adam Goucher, isn't that cloud computing introduces new risks. Rather, it's that it requires a different kind of thinking. Goucher should know; as a consultant for Selenium, he has presented full-day tutorials on the topic of cloud services, and recently authored Testing for the Cloud for the Pragmatic Bookshelf.
Goucher suggests a culture change from modifying configurations by hand to using automation tools all-out in every direction. "Some people just aren't comfortable with continuous delivery, so they take a piecemeal approach to architecture, borrowing one tool or another," he says. "Build a private cloud with Chef, but do DB migrations by hand, [and] that's going to introduce instability you don't have to introduce."
To get there, Goucher suggests starting with something other than your production application. Your first cloud application might be a Toy Project, a sample that doesn't do anything. After that, the team might write an internal tool, perhaps something to automate part of the testing process. Building the internal tool bootstraps the framework to get a cloud production without creating the risks.
Once you're confident with the tools, Goucher recommends a strangler pattern. "Take a small piece of the application, perhaps just the REST API, or just one REST API service. Segment it from the rest of the code and implement an entire end-to-end cloud stack," he says. "Configurate as code with Chef or Puppet; automate provisioning with [Amazon Elastic Cloud Compute] or OpenStack. Enable no-touch deployment and automate your database migrations. Do the entire conversion on a tiny sliver of functionality then extend it, not the other way around."
If a Server Falls In the Woods, But No One Hears It, It Makes No Sound
Its been six years since Ed Keyes, a test engineer at Google proclaimed at a Google Tech Talk, "Production monitoring, sufficiently advanced, is indistinguishable from testing." The "sense and respond" model that Keyes advocates asks this question: If your own operations team notices and fixes a problem in production before most customers, do you really have a problem at all?
Noah Sussman had that idea in mind when he implemented continuous integration and deployment at Etsy. "A lot of the research on preventing defects comes from air travel and medical software work, where the cost to make a change in production is very expensive and the impact of an error is catastrophic," Sussman says. "A long release tail might make sense for shipping a physical CD, or even pushing a version to an app store. With a website, you don't need to do that; anyone who refreshes the page gets the latest version. This changes the risk profile for the web."
Case Study: Rapid Application Development the Zappos Way
To get to rapid deployment, Sussman continues, "You need to get comfortable pushing to production all the time. That means it has to be safe to make changes on a constant basis. We view a change that doesn't have the desired effect as a learning outcome."
That means changes might not increase revenue. A site might even go down for a few minutes, and revenue might decrease during that time, but the team learns something from the experience with lasting value for next time.
To catch errors fast, Sussman suggests monitoring hooks on every level of the application, perhaps using a tool such as Stats.d. This requires open access to the codebase, where any programmer can commit on any level of the architecture at any time and, therefore, monitor every piece of system behavior.
"You need to make every piece of application data open, encourage people to share data, to share graphs," he says. "The more deeply the team understands the application as a whole, the [fewer] defects you'll have."
Testing in Production: An Idea That Just Might Work
Karthik Ravindran, a director of product management at Microsoft who works on Visual Studio, believes the fix is to consider production just another part of the lifecycle and to keep testing. Microsoft's Azure service includes the Global Services Monitor, which lets programmers take automated tests, register them with Azure and run them in production all the time. When a service goes down, the operations team can be notified immediately. It takes Keyes' idea of merging testing and monitoring one step further.
The last piece of the puzzle may be testing those pesky configurations: The data files that are (and should be) different between test and production. Some open source tools, including Puppet, do have the ability to write automated tests, but most companies keep those tests internal. As a result, there's no great body of open source examples.
A new generation of companies is emerging to provide configuration testing tools, though. Scriptrock, for example, provides a visual layer to design infrastructure, along with the tools to write tests against that infrastructure. You might write a test against a config file, then run that test to make the sure files that should not change will not change with a new rollout.
Between limiting risk with a strangler pattern, improving reaction and fix time, and testing in production with synthetic transactions, tools do exist to reduce risk in production. The challenge, as always, is to make good choices. Having a rich collection of options won't solve the problem—but it's a nice place to start.
Matthew Heusser is a consultant and writer based in West Michigan. You can follow Matt on Twitter @mheusser, contact him by email or visit the website of his company, Excelon Development. Follow everything from CIO.com on Twitter @CIOonline, Facebook, Google + and LinkedIn.