For most of its existence, IT resilience has focused on uptime, making sure systems don’t go down, and if they do, bringing them back online as quickly as possible.
But that is only part of the equation in this modern digital era. Today IT resilience means much more.
Consider, for example, Brad Stone’s take on it. As CIO for Booz Allen Hamilton, Stone says he thinks of resilience in two dimensions: One is about enabling the business without interruption; the second is about having the ability to adjust, deal with change, and handle the unexpected.
Moreover, Stone says, resilience now means doing all that while continually delivering the experience users expect.
“Ten years ago, if there was an outage, they’d get past it. But users and business leaders today expect technology to always work and to be an amazing experience; the expectations are so much higher now because IT is such an enabler, it has taken on more importance,” he says. “Users might not demand perfection but their standards are really, really high.”
That in turn has prompted a more expansive approach to ensuring IT resilience today. Here experts and IT leaders offer seven best practices CIOs should take on to ensure they meet current expectations for resiliency.
1. Align to business needs
Ron Brown, director of business resilience for GuidePoint Security, an advisory and services firm, defines IT resilience as making sure technology is always available — even as he acknowledges that such perfection isn’t likely.
“You do have to plan for the fact that things will go out at some point,” he says.
CIOs can best prepare for that inevitability by being clear on what systems matter most to the business; that clarity lets IT know what to focus on first during any sort of outage, he says.
“The first thing you have to do without a doubt is be in alignment with the business, what they need and what they are willing to pay for [to get] what they expect,” Brown says, noting that a business impact analysis can help IT and business get this alignment. “And once you have that understanding of what the requirements are for the business, then it’s about how do you map out the services and capabilities you have and which apps are used by which groups so if something goes wrong you know where to put your priorities to get them back up.”
2. Break down siloes
Richard Caralli, a former CISO now working as a senior advisor for Axio Global, a cyberrisk management company, says he sees resilience as “an emergent property that extends from managing operational risk.”
To do that well, IT operations and cybersecurity should be working with leaders overseeing business continuity/disaster recovery planning. That, however, doesn’t always happen, Caralli says.
“These activities tend to be siloed such that each discipline operates on different risk assumptions and scenarios, when in fact they must converge and work collaboratively,” he says.
For example, Caralli says an organization’s cybersecurity team may be focusing on creating a stellar defense-in-depth strategy to best ensure it can prevent intrusions, detect them if they happen, and respond when they do. But the team may not be as strong in planning for getting “back to normal operating conditions as quicky as possible with the least amount of consequences” if cybersecurity isn’t working closely with risk and IT, Caralli says.
“If they’re not all talking together, they might be planning or quantifying for different risks,” he adds. “They have to plan and run scenarios together. If you look at risk from an impact side and can envision what kind of consequences might occur, you can start to quantify the risk and you can then know where to spend the next dollar, whether to put it on the prevention side or to spend on practices that will reduce the impact.”
3. Mature your metrics
As IT resilience has evolved, Jorge Machado, a partner at management consulting firm McKinsey & Co., says CIOs should adjust the metrics they use to measure and manage operations to ensure they’re meeting the right objectives.
“Traditionally if we go back a decade it would be about uptime, availability of applications, and mean time to restore,” Machado says. “But nowadays, as apps become more microservices-oriented and we move away from monolith systems, we need to measure in a more nuanced way.”
He and colleague, McKinsey associate partner Arun Gundurao, suggest measurements focused on the ability to perform critical transactions such as those measuring failures in customer interactions, application experience from the user perspective, or service level objectives.
“It’s what does the business care about around this application or this customer journey,” Gundurao says. “You want to measure what the business wants to measure.’
In Stone’s opinion, resilience means successfully handling unexpected circumstances. And to do that, Stone makes sure his IT department isn’t unprepared. That means training, testing, and practicing with table-top exercises and simulations.
“It’s running exercises, taking down a cluster and not telling [everyone] and seeing how people respond. It’s almost like a live-fire simulation. You have to do that carefully, at the right time, but it has to be part of your cadence,” he says. “You have to have those standard operating procedures, go through them and refine those. You have to be willing to make your staff uncomfortable, challenge them. It gives them some camaraderie because they know they can get through things.”
Stone says such exercises give CIOs and their managers an opportunity to build confidence in processes that work well and build muscle memory, as well as identify weaknesses — such as a lack of redundancy in workers trained in key technologies or a lack of backup procedures should a particular application fail.
5. Architect resiliency
IT advisors stress that it’s important to build resiliency into the architecture itself by, for example, distributing instances and payloads across geographical locations.
One way to ensure resilient systems is to “simplify what you do so you can do it really well to meet expectations,” Stone says, noting that such an approach also helps keep teams from getting overextended.
Mixing in automation for incident, problem, and change management also helps build resiliency, he adds.
Gundurao recommends adopting site reliability engineering (SRE), a set of principles and practices for infrastructure and operations aimed at creating scalable, reliable systems. SRE — and those trained in its principles — focuses on building IT not just to work well in blue skies but to work through stormy skies, Machado adds.
Andrew Long, global enterprise architecture lead at Accenture, sees large traditional organizations increasingly adopting the principles, technologies, and methods used by digital-native organizations to architect more resilient IT systems. “This has enabled the business to improve its resilience to disruptive business events, and therefore become more competitive,” he says.
To do so, IT leaders are emphasizing speed and agility, data centricity, and decentralization, as well as continuous integration and delivery, SRE, and microservices to deliver the business capabilities the future organization requires … in a more modular and composable way,” Long says.
They are also shifting from traditional waterfall-based IT project delivery to “more product-centric IT delivery and operations, which tends to consider broader more strategic requirements that support IT resilience,” he adds.
“Almost all organizations have some part of the IT estate in the cloud,” Long says, but the key is “to consider what unique cloud capabilities can be leveraged to increase the organization’s ability to become more agile and resilient.”
6. Stay vigilant
Organizational risks, business needs, and technology will all continue to evolve, so should practices around IT resiliency, experts say.
“Engage with the business to understand where they see the risks of business disruption, the scale of the risk, and crucially, how they quantify this risk and therefore the potential value,” Long says. By having a clear understanding of the current state of your technology landscape, you can better understand how your organization can respond to this disruption, and where the critical risk areas reside.
“Confirm the specific interventions that need to be made to minimize the risk, and develop a roadmap to deliver change,” Long says, adding that the execution of this roadmap is possible only “if everyone is aligned on the business risk.”
7. Let business share in the accountability
The business side also has a role to play in IT resiliency, says Machado, so business unit leaders should have some accountability for it as well.
“I do think you have to have an accountability model, and we do think it should be shared with the business,” he explains, “so whoever builds the app should share responsibility for it. It should not just be the role of the CIO.”
Machado is not advocating for business units to take over IT operations and day-to-day management of apps and systems; rather, he says they should understand that their requirements and priorities can impact resiliency.
For example, if business unit leaders constantly prioritize time to market and speed to value creation, then they need to be share accountability for whether and by how much that could affect resilience.