How to create a company culture that can weather failure

In technology, things go wrong all the time, sometimes catastrophically. But if you stop paying attention after you fix the immediate problem, you’re missing out on the benefit of learning from experience.

1 2 Page 2
Page 2 of 2

Your competitors might or might not share the details of incidents they’ve faced and fixed (formally or informally), but you can also watch organizations with a similar technology setup and risk profile in other industries. Security vendors often blog step-by-step analyses of incidents. Nather also recommends a Twitter account @badthingsdaily that comes up with scenarios regularly: “Your partner database just went down, a tornado just destroyed your backup data centers. You can take them and talk them through. You can even go through the exercise of building the tool or doing the scripting to be able to automate the detection so that’s one less thing your people have to worry about doing manually.”

These tabletop exercises can be more palatable than the ‘chaos monkey’ approach pioneered by Netflix to simulate failure by deliberately shutting down some systems. “For less mature organizations, actually breaking something is a real concern, which is why even talking through it without actually doing anything can be very useful.”

Do have processes that take into account that people get tired

Many incident reports include a phrase like “it was now three o’clock in the morning” followed by a decision that actually prolonged the problem, but Lambert points out that "being late at night doesn’t change the frequency of alerts."

“Incidents caused by failures of machines and networks are not more frequent out of hours, but they are harder to respond to.” For one thing, during the day there are more people around to spot problems sooner. For another, unless you have dedicated support staff working shifts, “the person who has to deal with it has to get paged, they might be tired or distracted.”

When you look at what you can learn from an incident, look at what information is available to the people working on the problem and how quickly they can get it, so you can develop clear guidelines to avoid compounding the problem due to stress, confusion or fatigue.

“What can go wrong in high pressure situations is that people can essentially lose sight of the goal of fixing the problem,” warns Lambert. “You can also lose a lot of context and focus by having too many sources of information so we’ve learned to be very targeted about the information you pick.”

To avoid late night confusion, Nather suggests that “it's good to train until it becomes a reflex so you don't have to think so hard about who you're supposed to call; it comes more automatically."

Don’t ignore technical debt

Technical debt can be the reason you fall prey to ransomware, or it can just make key processes slower and less efficient.

“Assess your assets for business criticality, level of non-compliance with security hygiene, cost to remediate, and risk to the business if the asset is compromised, and develop lower cost, lower risk mitigations while you work on the most complex infrastructure renovations," advises Luta Security CEO Katie Moussouris. "Then develop a plan to keep the org healthy on an ongoing basis and make sure this plan itself is also reviewed for relevance and adjusted. Much of the technical debt that built up in the first place was due to an incorrect notion that whatever is working on the network shouldn't be touched in case it breaks.”

Do use all your resources

There are plenty of templates for incident response, though fewer that cover how to lean from incidents. Etsy’s Morgue tracker is open source and the company has also published an excellent debriefing facilitation guide.

The learning review process is as much about as communications as technology. “Business executive coaches who normally tackle lines of communications within the organization can address this area as well; not the technical aspects of where you need to pull information from, but what you do with it afterwards,” says Nather.

Do spread the word — inside and out

Part of making sure the knowledge you can gain from an incident is applied is passing on what you’ve learned.

“Make sure the resulting lessons are simply explained and made available for the entire organization to learn from,” says Hinchcliffe. “It's this last part that is frequently omitted and can doom organizations to proverbially relive IT history over and over again.”

You also want to share the lessons beyond your organization, suggests Nather. There are formal organizations like the Information Technology Information Sharing and Analysis Center, as well as similar organizations for financial services, oil and gas, healthcare, automotive, retail and legal, and plenty of informal routes for sharing intelligence. There’s a value in supporting and formalizing that, suggests Nather.

“If you have meeting space to offer for these folks to get together and talk, by virtue of being the leader who organizes it you immediately improve the standing of your own organization.”

Instead of treating failure as a threat to your reputation, sharing information says that you’re mature enough to cope with problems and learn from them — and that’s the culture you need to encourage.

Related reading

Related:

Copyright © 2017 IDG Communications, Inc.

1 2 Page 2
Page 2 of 2
Survey says! Share your insights in our 2020 CIO Tech Poll.