Nasdaq and Intermedia are among the latest firms to suffer lengthy -- and public -- service outages. Eventually, the same thing will happen to you. Here are four key lessons IT leaders can learn from others' mistakes.
Adding insult to injury, Nasdaq suffered a six-minute outage on Wednesday, Sept. 4. Though it involved the same system that was the culprit of the larger outage, a Nasdaq statement says “hardware memory failure in a back-end server” caused this outage.
It also wasn’t a great return from the Labor Day holiday for Intermedia, one of the world’s largest providers of hosted Microsoft Exchange services. On Sept. 3, the day after a long weekend in the United States, the provider had a five-hour outage, rendering email messages inaccessible. (Full disclosure: My company hosts its email service with Intermedia.) On top of that, Intermedia’s telephone service was hosted in the same data centers that suffered the outage, rendering their help desk unreachable and making this outage much worse than it ordinarily would’ve been. It also took Intermedia hours to post messages on Twitter explaining the outage and its efforts to resolve it — and those messages pointed customers to a service status page hosted on a customer portal that no one could access because, you guessed it, the platform suffering the outage hosted it, too.
As a popular saying for politicians goes, “Don’t ever let a good crisis go to waste.” There are lessons IT leaders can learn from these companies’ very public problems. Here are four takeaways you would do well to heed.
1. Regularly Test for, and Plan for, Disasters
Disasters happen. People regularly argue that you should be more positive about your operations and your deployments. But you can be positive about this: Stuff will fail and systems will go down. It’s not a matter of if — it’s a matter of when. Understand what an outage is going to look like for you — and understand what needs to happen.
Much of this disaster planning depends on what type of service you provide. If you’re a CIO charged with maintaining email service to 100,000 employees, your disaster plan will look different than a technical team that services 500,000 external customers. Understand how outages will impact different parts of your business.
Know what mitigation costs, as well as what backups cost and what standby systems cost. Investigate how cloud computing services such as Amazon Web Services and Windows Azure can make a tense outage situation a little more bearable, thanks to the ability to spin up services on demand, when you need them, and shut them down once your situation has eased.
Finally, put regular “mock failures” on your calendar. Walk through everyone who would be involved should a given outage occur and write down what responsibilities people have. Take the opportunity to engage all stakeholders without the pressure of a real outage. That way, your plan will be well-oiled when the inevitable does happen.
2. Isolate Your Communications From Your Service Platform
You might think eating your own dog food is a good policy. Putting your telephones, email, instant messaging and real-time communications right there in your super-fast data center, alongside the services you offer, seems to make sense.
Most of the time, it may work out well — but even a first-year junior systems administrator can see the issue with this setup. Once network connectivity is interrupted in that data center, for any reason, you’re toast. You can’t communicate. Your service is down. Customers get angry. Employees can’t work.
If you run an ecommerce site, you can’t complete orders or charge credit cards, and revenue evaporates. If customers can’t phone in an order either, though, you risk losing not only the order but the customer, too. The losses of an outage simply multiply in this scenario. In the example of the Intermedia outage, CEO Phil Koen notes that, “As our communication systems reside in the same datacenters, our ability to communicate with customers and partners was disrupted.”
That’s a quick way to watch your customers go elsewhere. For a company that prides itself on providing fault-tolerant hosted services to have made such a tremendous error in both its service topology and its ability to handle an outage, it boggles the mind. Don’t make this same mistake.
3. Communicate, Communicate and Communicate
When in doubt, communicate some more. The temptation during an outage is to focus on fixing the problem with just about every resource you can muster to put on the task. Don’t forget there are other stakeholders in the issue, depending on whether your outage is internal, external or both.
If you run a service for customers, they expect — and deserve — to know what’s going on and to receive an estimated time to service restoration. (Estimated time to service restoration,” by the way, means “half an hour” or “by noon,” not “shortly” or “as soon as possible.”) Meanwhile, if you experience an outage on an internal system, especially one that happens to be business-critical, then you need to send updates to affected parties both as soon as you understand that there’s an issue and then at regular, frequent intervals until the issue is resolved.
Communication can’t be an afterthought. It must be a high priority — second only to resolving the outage. Don’t make a bad situation worse by creating an information vacuum.
4. Run Off Your Backups Every Once in a While
One problems with having hot standby servers sitting around is that they rarely get an actual workout. It’s usually only under stress that those systems are used. Hardware problems and software quirks get magnified when largely unused systems suddenly take on a failover load.
Backup systems rarely share the same specifications as the primary systems that traditionally run a load. Many backup systems are more lightly equipped because they won’t be used very often. These decisions often have a way of coming back to haunt you, though.
One way around this is to regularly use your backup systems as production systems. Schedule times to move your regular load to your backup systems. Use them often enough that you’re confident in their ability to service should something go wrong with your primary systems.
As a CIO, a system outage that occurs on your watch is one of the worst ways to give your company publicity. With proper time and attention, though, you can at least make sure that, when outages attack, you’re prepared, confident and responsive so that a bad situation isn’t made worse.