by Barry Morris

Nonstop IT and the myth of zero-downtime

Opinion
May 15, 2017
Cloud Computing

Diamonds may be forever but they are non-trivial to produce; applications that are forever may be nearly as challenging to create.

cloud computing data center
Credit: Thinkstock

A common idea in modern applications is that they “run forever”. That’s a fine objective, and a valuable one. But it is elusive, particularly for applications that were never cloud-native in the first place. And it raises quite fundamental issues at the data management layer of the stack. Diamonds may be forever but they are non-trivial to produce; applications that are forever may be nearly as challenging to create.

Downtime cost estimates generally produce arresting numbers, but for many CIOs they don’t really tell the story. Your downtime costs could be measured in hundreds of dollars per hour or millions of dollars per hour, and your downtime risks could be measured in billions of dollars. An automated trading system that is down during a market correction event could quickly represent a multi-billion dollar loss. But in truth downtime is no longer about the costs and risks of crisis. It has become the new norm: “All systems up, all the time.” Welcome to nonstop IT.

So we have become quite excited about redundant microservices that neatly avoid overload by “just adding nodes,” and that enable graceful node failure by seamlessly redirecting load to live nodes. Elastic microservices address both capacity adjustment and transparent failover at the application layer. Nice.

Nonstop IT is also how BlueGreen has gone from a classic Crayola color to a model for continuous deployment of applications, the Julia Child approach of one-I-prepared-earlier. Switching over between full application stacks at the web service level is an elegantly blunt solution for live application evolution.

So for all greenfield applications that are stateless, the recipe for nonstop IT is in place. For everything else there is, of course, a much bigger problem. The world of databases, specifically, is considerably more challenging.

“Best practices” for zero-downtime with databases are “best” not because they are good ideas, but because of a paucity of alternatives. Planned database downtime and unplanned database downtime are the hardest challenges as we strive for the plenty-of-nines service-level agreement (SLA).

High availability (HA) solutions for traditional relational database management systems (RDBMS) are complex and fragile. The RDBMS administrator is faced with a spectrum of alternatives, with a set of trade-offs for each. A typical answer to live upgrades is to use Oracle Data Guard to create an active standby and bravely leap from old to new. In the best case, these help to reduce downtime but are not architected as zero-downtime solutions. And naturally, the people and tools costs can be quite significant.

The dream is a database system with:

  • No single point of failure (SPOF)
  • Dynamic provisioning
  • Push-button backup and restore
  • Online maintenance (upgrades, schema changes, storage management, etc.)
  • Easy automation of administrative tasks
  • A Jobs-esque insane simplicity of usage

You might call that a “cloud-native database.”

If the nonstop IT imperative leads away from traditional single-server databases and toward cloud-native databases, then what realistic alternative directions are emerging for an organization building and migrating applications for elastic infrastructures? The answers really fall into three categories:

  1. Database as a service (DBaaS)
  2. NoSQL solutions
  3. Elastic SQL solutions

DBaaS offerings are pay-by-the-drink services available from cloud providers (IaaS) and third parties. The great advantage is that the challenges of managing the service are not your problem. You pay for a service with an SLA, and someone else takes care of how it is performed. However, there is typically much less functional and operational flexibility than is usually expected for enterprise applications. And there is much less control.

NoSQL solutions provide database capabilities that have advantages in terms of scale-out. Unlike traditional database systems, they are neither single-server solutions nor tightly clustered solutions. This enables them to provide resilience to failure and much better support for nonstop IT. On the downside, NoSQL solutions do not use the industry-standard query language (SQL) and do not provide the data guarantees (such as consistency and durability) that traditional databases provide and that applications generally expect.

Elastic SQL solutions aim to provide all of the advantages of a SQL RDBMS but with all the capabilities of a cloud-native database system. Applications interact with them in an identical manner to the way that they interact with a traditional database system, but under the covers the system is built as an elastic, distributed system. The disadvantage is that most of elastic SQL solutions are relatively young — only just out of beta or have varying degrees of SQL support. The trick here is to find one that has been in production use for a few years so that the bugs have all been worked out, and with sufficient SQL support such that you can easily migrate your existing applications.

Different answers are appropriate in different circumstances, of course. But if your strategic need is to deliver nonstop IT, then it may be time to review your plans on data management. An always-on application needs an always-on database system, and naturally enough that was never a design objective of the client-server RDBMS.