In talking with members of the #CIOChat on Twitter recently after British Airways’ major IT systems failure, I wanted to know what CIOs think could be done to better protect organizations from these types of disasters. Chat members started by confirming that it is very hard to quickly bring back up large scale, high volume systems with legacy applications and infrastructure. Isaac Sacolick, former CIO for Greenwich Associates and current CIO.com Contributor, said it is very hard to get “N+1 high availability, geographic redundancy, transaction level data replication plus practice failover and outage scenarios regularly.” With this, CIOs made several concrete suggestions that should be valuable to any organization trying to avoid crippling IT outages.
Establishing disaster recovery and business continuity plans
CIOs say there is always a struggle when deciding to invest in improving legacy systems versus implementing new, innovative technology. They say it is a tough job. They say that disaster recovery and business continuity plans often end up at the bottom of the budget priorities list. At the same time, they assert that IT cannot be the endpoint here. Business continuity means training the entire business to respond.
CIOs amazingly say not all organizations have backup and disaster recovery plans. For this reason, they believe that the starting point for many organizations is getting a disaster recovery and business continuity plan in place. CIOs consider it essential that the business have in place contingency plans for restoring service after a disruption. It is also important to have realistic Service Level Agreements (SLAs) that are, ideally, the basis of an overarching plan. And the plan itself should be based on the enterprise’s institutional risk stance and business continuity needs.
In terms of backup, one IT leader suggested that “anything less than a 100 percent service backup isn’t disaster recovery, it is disaster coping.” Importantly, CIOs suggest that the plan needs to recognize that every change to applications or infrastructure can change your disaster recovery (DR)/business continuity (BC) posture. In fact, the impact upon DR/BC should be factored into change management. At the same time, CIOs suggest that plans need to explicitly recognize that backup and restore is often slow but “gets geometrically slower when you have to restore Y to bring up X and Z for Y and so on.”
Managing business continuity involves testing early and often
CIOs argue for continuous testing of plans. Peter Salvitti, CTO for Boston College, says that “there is no way around this: test, test, and more test! Hard work? Yeah. Needed? Yeah! Why would you risk IT?” CIOs believe practicing failover and outage scenarios regularly is required. You also need to test backups regularly, otherwise you are wasting investment dollars here without validation. CIOs, say you need to simulate disasters and have employees take part in them at least once a year.CIOs argue that the only way to know how well your organization will respond to an outage is to practice. And just like the President’s need to practice a snap count for a nuclear disaster, you need to practice an IT disaster as if it’s real. You need to test it on a regular basis to validate that your process and recovery works. This cannot be a once and done exercise.
Learning from failure
CIOs say that you should learn from a crisis and adjust accordingly. At the same time, you need to realize that every catastrophe is different. Backup and failover is not enough. Your plans need to be for those to fail too, right at that moment. Otherwise, you do not have a realistic plan.
Clearly, incidents like the ones that crippled British Airways and Delta Air Lines before it need to be planned for. CIOs need to plan for technology disaster and they need to regularly conduct tests to minimize the chance of failure. Otherwise, organizations cannot put in place the learning to minimize the potential impact a serious systems outage will have on their business and reputation.