Whether your services are hosted in the cloud, you are a consumer of cloud services or you are a cloud provider, chances are you are dealing with vendors that are making promises about uptime and availability.
Examining the nuances around uptime, downtime, service-level agreements and “9s”-style availability can give you a better understand of what you are agreeing to when you purchase hardware and services.
The vocabulary of downtime
Remember that vendors have deliberate reasoning behind the vocabulary and mechanisms they use to communicate downtime. It is, simply put, to minimize any financial liability and monetary outlays they would have to part with in order to satisfy the performance covenants of their contract.
[ Related: How Google avoids downtime ]
You might think of downtime as the physical time that a service is down, but a more useful metric might be “effective downtime,” which adds to that status the time it takes to recover those pieces of hardware or software to production status. After all, there is little reason to consider a service back “up” if the power has come back on but an array needs an hour to check its integrity.
Clustering and other sorts of fault tolerance solutions mitigate this somewhat; some newer clustering solutions can even handle failing over a load to another host or client with no perceptible downtime. But not all systems are equipped to handle this.
Additionally, downtime is really only as good as your lowest common denominator. If you are running a cloud service and your servers are promised at five 9s and your storage array is promised at five 9s but your routers are only at four 9s, your whole solution is only as good as four 9s.
[ Related: CIOs reveal their worst nightmares ]
Finally, consider that your vendor’s definition of uptime may be measured as whether it can see the service is up — not whether the service is available to you. Availability may be measured in a similar way. It is up to you to verify with your vendor, and to get into writing, these definitions so all parties are working from the same rulebook.
Understanding ‘the 9s’
“The 9s” is used to define availability percentages. That server has five 9s uptime, that service has four 9s uptime, my consumer DSL subscription has negative two 9s and so on. This refers to the 99-plus percent availability ranges that are common with these calculations.
There are 8,760 hours in a year. With this fact in mind, here is the math behind “the 9s” and how to compute the availability to a time duration that your services might be in an “allowed down” state per your contract:
- Five 9s: 8,760 multiplied by .99999 equals 8759.9124 hours of uptime, or .0876 hours of allowed downtime per year, which translates to five minutes and 15 seconds per year.
- Four 9s: 8,760 multiplied by .9999 equals 8759.124 hours of uptime, or .876 hours of allowed downtime per year, which translates to 52 minutes and 30 seconds per year.
- Three 9s: 8,760 multiplied by .999 equals 8751.24 hours of uptime, or 8.76 hours of allowed downtime per year, which translates to a little over eight hours and 45 minutes per year.
You can see how there’s a big difference between three 9s and five 9s – more than an entire business day’s worth of hours, which might even be used up in a single event such as a fiber cut. Again, these calculations are based on a default agreement of 24/7 availability at 99.X percent. There is more on this critical detail in the next section.
Downtime and its interplay with SLA
Many companies have service-level agreements, or SLAs, with their service providers in order to provide both a level of protection against effective downtime and some sort of compensation toward that effective downtime.
Most CIOs would probably agree that the SLA is mainly meant to deter downtime and ensure availability; no one really wants credit on their bills as a refund for service not provided. However, there are facets within your agreement that reveal exactly what protections you have and what your effective uptime is really computed against:
- What amounts of scheduled maintenance and downtime windows are allowed? Most service providers will include specific maintenance windows on a weekly or monthly basis to allow for patching, reconfiguration, system or router moves, or other sorts of off-hours regular administrative tasks.
They will also usually call out that these downtime windows, since they are regularly scheduled and planned for, do not count against the SLA for the purposes of compensation.
- Is your service-level agreement based on true 24/7 availability, or is it measured against some lesser standard of “full time”? A number of agreements only guarantee availability of 20/6, 12/5, or some other lesser version of a 168-hour week. It is comparatively much easier to get five nines of uptime during business hours when you can go down as much as you want on the off hours.
Please understand: There is a reason a vendor would play this trick, and it is not just about an abundance of caution. It is a statement that a vendor does not believe it can deliver uptime, for the price you are paying and for the service you are buying, on an enterprise level. Think about that before you sign on the dotted line.