Do you change processes after handling an incident, or do you just carry on and wait for the next problem? Instead of dealing with individual failures, think about creating a culture in your IT department that can not only handle problems but truly learn from them.
Cloud providers are routinely better at learning from failure than most enterprises — because they have to be. It’s critical that they are transparent about failures to keep the trust of their customers, but it also hits the bottom line if they take too long to solve problems. When AWS, Google, Azure or GitHub has a major outage, you’ll see regular updates, and once the problem has been fixed, a public incident response will cover what changes are being made to make sure the same thing doesn’t happen again.
For example, when an engineer at GitLab accidentally deleted the production database earlier this year (while trying to recover from load issues caused by spammers), the service was down for several hours. Worse still, nearly all of the backup tools GitLab was using turned out not to have been creating backups and six hours of production data across some 5,000 projects was lost. The engineers documented what was happening in real time (on Twitter, YouTube and in a shared public document), followed by a blog with the key details and a full post-mortem. This explained not just the sequence of what went wrong but also the misconfigurations and other complications that resulted in having no up-to-date backups, giving them a clear list of the on-going changes that needed to be made.
Or consider Target’s data breach, which did a lot more damage to the company. After discovering at the peak of the 2013 Christmas shopping rush that hackers had installed malware in their credit card terminals, the retailer found that the details of around 40 million debit and credit cards had been stolen, as well as names, addresses and phone numbers for up to 70 million customers. The data breach cost the business over $100 million in settlements with banks, Visa and a federal class action suit, and Target CEO Gregg Steinhafel resigned in 2014.
Fast forward three years, though, and “Target has become a role model for other retailers,” Wendy Nather, former CISO and principal security strategist at security firm Duo, told CIO.com.
“They made a huge turnaround after their breach; they really built up their security program to the point where they really have a lot of transparency. They host security events. They were one of the organizations that helped found R-CISC, the Retail Cyber Intelligence Sharing Center. They really have led the charge to start exchanging threat intelligence amongst retailers.”
That puts Target in a much better place than if it had only fixed the immediate problems and then stopped. “Other organizations that have been breached have circled the wagons. Their attorneys didn’t let them say anything, they’re not learning from the breach, they’re not changing their spending on security and it’s very clear they will fall to the same kind of breach again later.”
The difference is as much culture as technology, Nather said. “It’s all in how they responded and made something positive come out of what was a terrible situation.”
If you want turn problems into learning experiences, there are some key do’s and don’ts.
Do follow up
Adding this step to your playbook of what to do when things go wrong may seem obvious, but you have to follow up on an incident so you can learn from it.
“Schedule a formal review of the incident and identify next steps,” says Stephen Burgess, consultant at the Uptime Institute. He suggests having regular meetings designed to track incidents to a final resolution, to make sure the longer term changes actually happen.
“From the root cause should come any formalized lessons learned, which in turn must clearly identify whether there are any final corrective actions. Maintain scrutiny and open status of the failure incident until there is managerial confirmation that final corrective actions have been performed.” That might mean training, changing policies, processes and procedures, or making proactive repairs and infrastructure upgrades.
Sam Lambert, senior director of infrastructure at GitHub, suggests that IT could learn from other disciplines. “Other industries that build things and build things to last and want to learn from failures in things they build, carry out investigations as standard operating procedure.Look at flight investigations and how useful they’ve been for aviation safety.”
View failures as a chance to get ahead of similar potential problems, Lambert says. “If a failure case comes up and we recognize that failure case could be systemic in some other system, analyzing it gives us an opportunity to look at what may go wrong in the future.”
He points to several areas where GitHub has been able to go beyond fixing the immediate problem to improving their systems generally. “We’ve learned about cause and effect: one service going wrong can affect other services even when they’re not the cause of the problem. We’ve learned ways to build in safeguards and do checking in our development process. We’ve learned to respect the time necessary to make systems resilient the first time. We’ve also learned that some things can’t be prevented and you’ve just got to accept that and understand that you have to learn from them each time.”
Don’t play the blame game
Whether it’s external problems or an increasing willingness to try “more risky fare like fail-fast experimentation, open hackathons, and citizen developer programs,” CIOs are even more likely to face major IT failures, Dion Hinchcliffe, VP and principal analyst at Constellation Research, told CIO.com.
“The first step is to prepare for failures with solid contingency plans, but it’s also key to learn from failure through an honest and open, blame-free process.”
He admits that “this can be hard for IT for practical reasons — given the already maximized work schedules — as well as human ones: A hit to morale can occur when really digging into the root cause of failures and observing dysfunction.”
If the investigation focuses on assigning blame rather than understanding the systemic failures that led to the incident, you won’t make staff feel safe enough to share information, suggest solutions, warn you about possible issues or absorb the lessons of the incident.
To help avoid blame, Nather suggests “not looking backwards and rehashing it and saying ‘If only this had happened…’ It’s better to say, ‘If we assume this could happen again, how could we respond better this time?’” Not only does that remove the notion of finding fault, but it’s also more realistic. “Everyone would like to look at an incident and say, ‘We’ll never have that happen again,’ but you can’t really say that!”
Rather than assigning blame, Lambert recommends understanding the reasoning behind decisions. “Often, doing dumb stuff is about not having time to do good stuff. People make trade-offs that they’re not necessary happy with but sometimes you just have to do that. Sit down with the person who made those trade-offs and ask them why. What were the pressures, what was the information they had that made these trade-offs make sense.”
Don’t call it a post mortem
Although the term “blameless post-mortem” is common — popularized by companies like Etsy, whose tracker for the process is called Morgue — Nather suggests picking a friendlier phrase. “If you call it a post-mortem that sounds so terribly morbid! The term we use is an after action report. We try to make it a very positive thing, rather than thinking of it as ‘having survived the battle we will now count our wounded and dead’.”
Don’t call it human error
When British Airways had to cancel all flights from Gatwick and Heathrow airports over a bank holiday weekend this May, it blamed the IT failure that stranded some 75,000 travellers on human error. A contractor appears to have turned the uninterruptable power supply off and the power surge when it was turned back on damaged systems in its data center. BA promised an independent investigation, but its initial explanation raised questions over the design of both the power and backup systems.
By contrast, when an engineer mistyped a command that took down the AWS S3 service — and many other services that depended on it, like Quora and file sharing in Slack — for several hours, Amazon’s explanation avoided the phrase “human error” and concentrated on explaining the flaws in the tools and process that allowed the mistake to be made.
Lambert maintains that “human error doesn’t really exist. Providing that you hire good people who want to do right thing, they will usually do the right thing. It’s rare that you can say a person discarded the all good information they had and just did what they wanted and that’s why we had this issue.”
The real problem is tools and processes that don’t prevent (or at least issue warnings about) the inevitable mistakes people make, or the lack of automation that means someone is typing in the first place.
“It’s a lazy approach to say people did the wrong thing,” says Lambert. “A better approach is to assume that everyone did right thing with the information they had, so you need to take away the blame and look at what information they had at each stage and what was missing, and what additional tools or processes you need to get better next time. “
Do reward reporting
The tale of a developer fired for confessing to deleting the production database on day one at a new job may be apocryphal, but the account on Reddit was certainly plausible and led many to point out that the fault lay not with the new developer, but with the documentation that included the details of the production database in a training exercise.
In contrast, when a chemistry student at the University of Bristol in the UK accidentally made an explosive and reported it, even though the emergency services had to carry out a controlled detonation, the dean of the Faculty of Science Timothy C. Gallagher, praised the student for acting responsibly. He pointed out “the value of investing in developing and fostering a culture in which colleagues recognise errors and misjudgements, and they are supported to report near misses.”
In the airline industry, the International Confidential Aviation Safety Systems Group collects confidential, anonymous reports of near misses, cabin fires, maintenance and air traffic control problems to encourage full disclosure of problems. Similarly, when the US Forest Service conducts Learning Reviews after serious fires the results can be used only for preventing accidents, not legal or disciplinary action.
You want your team to feel safe enough to report the problems that haven’t yet led to a failure.
“Whether formalized in a policy or not, the team must be well aware that mistakes are tolerated, but concealment and cover-up are not,” says Burgess. “Personnel must clearly understand they will never be penalized for volunteering any and all information regarding any failure.”
“Part of your responsibility as a CIO is to build these relationships,” explains Nather. “The system admins should be your eyes and ears. You want to have the culture where someone will come into your office and close the door and say, ‘There’s something I think you ought to know.’ If you can get that, you can build a resilient organization.”
Treating IT and security as being a business service rather than a point of control helps create that kind of culture. “If you take the attitude that you’re there to help everyone else with their business, that’s very different from sitting in an ivory tower and saying, ‘Ooh you did something wrong, you missed a spot’,” she says.
Do learn from others’ mistakes
Thinking about what you’d do differently the next time a problem occurs is useful, but you can also think about how you’d tackle problems you haven’t run into yet.
“What I see in in very mature organizations is that they also try to learn from other people’s incidents,” says Nather. “Ask, ‘If that were to happen to us, what would it look like, how could we detect it and how could we respond to it?’”
Your competitors might or might not share the details of incidents they’ve faced and fixed (formally or informally), but you can also watch organizations with a similar technology setup and risk profile in other industries. Security vendors often blog step-by-step analyses of incidents. Nather also recommends a Twitter account @badthingsdaily that comes up with scenarios regularly: “Your partner database just went down, a tornado just destroyed your backup data centers. You can take them and talk them through. You can even go through the exercise of building the tool or doing the scripting to be able to automate the detection so that’s one less thing your people have to worry about doing manually.”
These tabletop exercises can be more palatable than the ‘chaos monkey’ approach pioneered by Netflix to simulate failure by deliberately shutting down some systems. “For less mature organizations, actually breaking something is a real concern, which is why even talking through it without actually doing anything can be very useful.”
Do have processes that take into account that people get tired
Many incident reports include a phrase like “it was now three o’clock in the morning” followed by a decision that actually prolonged the problem, but Lambert points out that “being late at night doesn’t change the frequency of alerts.”
“Incidents caused by failures of machines and networks are not more frequent out of hours, but they are harder to respond to.” For one thing, during the day there are more people around to spot problems sooner. For another, unless you have dedicated support staff working shifts, “the person who has to deal with it has to get paged, they might be tired or distracted.”
When you look at what you can learn from an incident, look at what information is available to the people working on the problem and how quickly they can get it, so you can develop clear guidelines to avoid compounding the problem due to stress, confusion or fatigue.
“What can go wrong in high pressure situations is that people can essentially lose sight of the goal of fixing the problem,” warns Lambert. “You can also lose a lot of context and focus by having too many sources of information so we’ve learned to be very targeted about the information you pick.”
To avoid late night confusion, Nather suggests that “it’s good to train until it becomes a reflex so you don’t have to think so hard about who you’re supposed to call; it comes more automatically.”
Don’t ignore technical debt
Technical debt can be the reason you fall prey to ransomware, or it can just make key processes slower and less efficient.
“Assess your assets for business criticality, level of non-compliance with security hygiene, cost to remediate, and risk to the business if the asset is compromised, and develop lower cost, lower risk mitigations while you work on the most complex infrastructure renovations,” advises Luta Security CEO Katie Moussouris. “Then develop a plan to keep the org healthy on an ongoing basis and make sure this plan itself is also reviewed for relevance and adjusted. Much of the technical debt that built up in the first place was due to an incorrect notion that whatever is working on the network shouldn’t be touched in case it breaks.”
Do use all your resources
There are plenty of templates for incident response, though fewer that cover how to lean from incidents. Etsy’s Morgue tracker is open source and the company has also published an excellent debriefing facilitation guide.
The learning review process is as much about as communications as technology. “Business executive coaches who normally tackle lines of communications within the organization can address this area as well; not the technical aspects of where you need to pull information from, but what you do with it afterwards,” says Nather.
Do spread the word — inside and out
Part of making sure the knowledge you can gain from an incident is applied is passing on what you’ve learned.
“Make sure the resulting lessons are simply explained and made available for the entire organization to learn from,” says Hinchcliffe. “It’s this last part that is frequently omitted and can doom organizations to proverbially relive IT history over and over again.”
You also want to share the lessons beyond your organization, suggests Nather. There are formal organizations like the Information Technology Information Sharing and Analysis Center, as well as similar organizations for financial services, oil and gas, healthcare, automotive, retail and legal, and plenty of informal routes for sharing intelligence. There’s a value in supporting and formalizing that, suggests Nather.
“If you have meeting space to offer for these folks to get together and talk, by virtue of being the leader who organizes it you immediately improve the standing of your own organization.”
Instead of treating failure as a threat to your reputation, sharing information says that you’re mature enough to cope with problems and learn from them — and that’s the culture you need to encourage.