Do you change processes after handling an incident, or do you just carry on and wait for the next problem? Instead of dealing with individual failures, think about creating a culture in your IT department that can not only handle problems but truly learn from them.\n\nCloud providers are routinely better at learning from failure than most enterprises \u2014 because they have to be. It\u2019s critical that they are transparent about failures to keep the trust of their customers, but it also hits the bottom line if they take too long to solve problems. When AWS, Google, Azure or GitHub has a major outage, you\u2019ll see regular updates, and once the problem has been fixed, a public incident response will cover what changes are being made to make sure the same thing doesn\u2019t happen again.\n\nFor example, when an engineer at GitLab accidentally deleted the production database earlier this year (while trying to recover from load issues caused by spammers), the service was down for several hours. Worse still, nearly all of the backup tools GitLab was using turned out not to have been creating backups and six hours of production data across some 5,000 projects was lost. The engineers documented what was happening in real time (on Twitter, YouTube and in a shared public document), followed by a blog with the key details and a full post-mortem. This explained not just the sequence of what went wrong but also the misconfigurations and other complications that resulted in having no up-to-date backups, giving them a clear list of the on-going changes that needed to be made.\n\nOr consider Target\u2019s data breach, which did a lot more damage to the company. After discovering at the peak of the 2013 Christmas shopping rush that hackers had installed malware in their credit card terminals, the retailer found that the details of around 40 million debit and credit cards had been stolen, as well as names, addresses and phone numbers for up to 70 million customers. The data breach cost the business over $100 million in settlements with banks, Visa and a federal class action suit, and Target CEO Gregg Steinhafel resigned in 2014.\n\nFast forward three years, though, and \u201cTarget has become a role model for other retailers,\u201d Wendy Nather, former CISO and principal security strategist at security firm Duo, told CIO.com.\n\n\u201cThey made a huge turnaround after their breach; they really built up their security program to the point where they really have a lot of transparency. They host security events. They were one of the organizations that helped found R-CISC, the Retail Cyber Intelligence Sharing Center. They really have led the charge to start exchanging threat intelligence amongst retailers.\u201d\n\nThat puts Target in a much better place than if it had only fixed the immediate problems and then stopped. \u201cOther organizations that have been breached have circled the wagons. Their attorneys didn\u2019t let them say anything, they\u2019re not learning from the breach, they\u2019re not changing their spending on security and it\u2019s very clear they will fall to the same kind of breach again later.\u201d\n\nThe difference is as much culture as technology, Nather said. \u201cIt\u2019s all in how they responded and made something positive come out of what was a terrible situation.\u201d\n\nIf you want turn problems into learning experiences, there are some key do\u2019s and don\u2019ts.\n\nDo follow up\n\nAdding this step to your playbook of what to do when things go wrong may seem obvious, but you have to follow up on an incident so you can learn from it.\n\n\u201cSchedule a formal review of the incident and identify next steps,\u201d says Stephen Burgess, consultant at the Uptime Institute. He suggests having regular meetings designed to track incidents to a final resolution, to make sure the longer term changes actually happen.\n\n\u201cFrom the root cause should come any formalized lessons learned, which in turn must clearly identify whether there are any final corrective actions. Maintain scrutiny and open status of the failure incident until there is managerial confirmation that final corrective actions have been performed.\u201d That might mean training, changing policies, processes and procedures, or making proactive repairs and infrastructure upgrades.\n\nSam Lambert, senior director of infrastructure at GitHub, suggests that IT could learn from other disciplines. \u201cOther industries that build things and build things to last and want to learn from failures in things they build, carry out investigations as standard operating procedure.Look at flight investigations and how useful they\u2019ve been for aviation safety.\u201d\n\nView failures as a chance to get ahead of similar potential problems, Lambert says. \u201cIf a failure case comes up and we recognize that failure case could be systemic in some other system, analyzing it gives us an opportunity to look at what may go wrong in the future.\u201d\n\nHe points to several areas where GitHub has been able to go beyond fixing the immediate problem to improving their systems generally. \u201cWe\u2019ve learned about cause and effect: one service going wrong can affect other services even when they're not the cause of the problem. We\u2019ve learned ways to build in safeguards and do checking in our development process. We\u2019ve learned to respect the time necessary to make systems resilient the first time. We\u2019ve also learned that some things can't be prevented and you\u2019ve just got to accept that and understand that you have to learn from them each time.\u201d\n\nDon\u2019t play the blame game\n\nWhether it\u2019s external problems or an increasing willingness to try \u201cmore risky fare like fail-fast experimentation, open hackathons, and citizen developer programs,\u201d CIOs are even more likely to face major IT failures, Dion Hinchcliffe, VP and principal analyst at Constellation Research, told CIO.com.\n\n\u201cThe first step is to prepare for failures with solid contingency plans, but it\u2019s also key to learn from failure through an honest and open, blame-free process.\u201d\n\nHe admits that \u201cthis can be hard for IT for practical reasons \u2014 given the already maximized work schedules \u2014 as well as human ones: A hit to morale can occur when really digging into the root cause of failures and observing dysfunction.\u201d\n\nIf the investigation focuses on assigning blame rather than understanding the systemic failures that led to the incident, you won\u2019t make staff feel safe enough to share information, suggest solutions, warn you about possible issues or absorb the lessons of the incident.\n\nTo help avoid blame, Nather suggests \u201cnot looking backwards and rehashing it and saying 'If only this had happened\u2026\u2019 It\u2019s better to say, \u2018If we assume this could happen again, how could we respond better this time?'\u201d Not only does that remove the notion of finding fault, but it\u2019s also more realistic. \u201cEveryone would like to look at an incident and say, 'We\u2019ll never have that happen again,' but you can't really say that!\u201d\n\nRather than assigning blame, Lambert recommends understanding the reasoning behind decisions. \u201cOften, doing dumb stuff is about not having time to do good stuff. People make trade-offs that they're not necessary happy with but sometimes you just have to do that. Sit down with the person who made those trade-offs and ask them why. What were the pressures, what was the information they had that made these trade-offs make sense.\u201d\n\nDon\u2019t call it a post mortem\n\nAlthough the term \u201cblameless post-mortem\u201d is common \u2014 popularized by companies like Etsy, whose tracker for the process is called Morgue \u2014 Nather suggests picking a friendlier phrase. \u201cIf you call it a post-mortem that sounds so terribly morbid! The term we use is an after action report. We try to make it a very positive thing, rather than thinking of it as \u2018having survived the battle we will now count our wounded and dead\u2019.\u201d\n\nDon\u2019t call it human error\n\nWhen British Airways had to cancel all flights from Gatwick and Heathrow airports over a bank holiday weekend this May, it blamed the IT failure that stranded some 75,000 travellers on human error. A contractor appears to have turned the uninterruptable power supply off and the power surge when it was turned back on damaged systems in its data center. BA promised an independent investigation, but its initial explanation raised questions over the design of both the power and backup systems.\n\nBy contrast, when an engineer mistyped a command that took down the AWS S3 service \u2014 and many other services that depended on it, like Quora and file sharing in Slack \u2014 for several hours, Amazon\u2019s explanation avoided the phrase \u201chuman error\u201d and concentrated on explaining the flaws in the tools and process that allowed the mistake to be made.\n\nLambert maintains that \u201chuman error doesn't really exist. Providing that you hire good people who want to do right thing, they will usually do the right thing. It\u2019s rare that you can say a person discarded the all good information they had and just did what they wanted and that's why we had this issue.\u201d\n\nThe real problem is tools and processes that don\u2019t prevent (or at least issue warnings about) the inevitable mistakes people make, or the lack of automation that means someone is typing in the first place.\n\n\u201cIt\u2019s a lazy approach to say people did the wrong thing,\u201d says Lambert. \u201cA better approach is to assume that everyone did right thing with the information they had, so you need to take away the blame and look at what information they had at each stage and what was missing, and what additional tools or processes you need to get better next time. \u201c\n\nDo reward reporting\n\nThe tale of a developer fired for confessing to deleting the production database on day one at a new job may be apocryphal, but the account on Reddit was certainly plausible and led many to point out that the fault lay not with the new developer, but with the documentation that included the details of the production database in a training exercise.\n\nIn contrast, when a chemistry student at the University of Bristol in the UK accidentally made an explosive and reported it, even though the emergency services had to carry out a controlled detonation, the dean of the Faculty of Science Timothy C. Gallagher, praised the student for acting responsibly. He pointed out \u201cthe value of investing in developing and fostering a culture in which colleagues recognise errors and misjudgements, and they are supported to report near misses.\u201d\n\nIn the airline industry, the International Confidential Aviation Safety Systems Group collects confidential, anonymous reports of near misses, cabin fires, maintenance and air traffic control problems to encourage full disclosure of problems. Similarly, when the US Forest Service conducts Learning Reviews after serious fires the results can be used only for preventing accidents, not legal or disciplinary action.\n\nYou want your team to feel safe enough to report the problems that haven\u2019t yet led to a failure.\n\n\u201cWhether formalized in a policy or not, the team must be well aware that mistakes are tolerated, but concealment and cover-up are not,\u201d says Burgess. \u201cPersonnel must clearly understand they will never be penalized for volunteering any and all information regarding any failure.\u201d\n\n\u201cPart of your responsibility as a CIO is to build these relationships,\u201d explains Nather. \u201cThe system admins should be your eyes and ears. You want to have the culture where someone will come into your office and close the door and say, \u2018There\u2019s something I think you ought to know.\u2019 If you can get that, you can build a resilient organization.\u201d\n\nTreating IT and security as being a business service rather than a point of control helps create that kind of culture. \u201cIf you take the attitude that you\u2019re there to help everyone else with their business, that\u2019s very different from sitting in an ivory tower and saying, \u2018Ooh you did something wrong, you missed a spot\u2019,\u201d she says.\n\nDo learn from others' mistakes\n\nThinking about what you\u2019d do differently the next time a problem occurs is useful, but you can also think about how you\u2019d tackle problems you haven\u2019t run into yet.\n\n\u201cWhat I see in in very mature organizations is that they also try to learn from other people's incidents,\u201d says Nather. \u201cAsk, \u2018If that were to happen to us, what would it look like, how could we detect it and how could we respond to it?\u2019\u201d\n\nYour competitors might or might not share the details of incidents they\u2019ve faced and fixed (formally or informally), but you can also watch organizations with a similar technology setup and risk profile in other industries. Security vendors often blog step-by-step analyses of incidents. Nather also recommends a Twitter account @badthingsdaily that comes up with scenarios regularly: \u201cYour partner database just went down, a tornado just destroyed your backup data centers. You can take them and talk them through. You can even go through the exercise of building the tool or doing the scripting to be able to automate the detection so that\u2019s one less thing your people have to worry about doing manually.\u201d\n\nThese tabletop exercises can be more palatable than the \u2018chaos monkey\u2019 approach pioneered by Netflix to simulate failure by deliberately shutting down some systems. \u201cFor less mature organizations, actually breaking something is a real concern, which is why even talking through it without actually doing anything can be very useful.\u201d\n\nDo have processes that take into account that people get tired\n\nMany incident reports include a phrase like \u201cit was now three o\u2019clock in the morning\u201d followed by a decision that actually prolonged the problem, but Lambert points out that "being late at night doesn\u2019t change the frequency of alerts."\n\n\u201cIncidents caused by failures of machines and networks are not more frequent out of hours, but they are harder to respond to.\u201d For one thing, during the day there are more people around to spot problems sooner. For another, unless you have dedicated support staff working shifts, \u201cthe person who has to deal with it has to get paged, they might be tired or distracted.\u201d\n\nWhen you look at what you can learn from an incident, look at what information is available to the people working on the problem and how quickly they can get it, so you can develop clear guidelines to avoid compounding the problem due to stress, confusion or fatigue.\n\n\u201cWhat can go wrong in high pressure situations is that people can essentially lose sight of the goal of fixing the problem,\u201d warns Lambert. \u201cYou can also lose a lot of context and focus by having too many sources of information so we\u2019ve learned to be very targeted about the information you pick.\u201d\n\nTo avoid late night confusion, Nather suggests that \u201cit's good to train until it becomes a reflex so you don't have to think so hard about who you're supposed to call; it comes more automatically."\n\nDon\u2019t ignore technical debt\n\nTechnical debt can be the reason you fall prey to ransomware, or it can just make key processes slower and less efficient.\n\n\u201cAssess your assets for business criticality, level of non-compliance with security hygiene, cost to remediate, and risk to the business if the asset is compromised, and develop lower cost, lower risk mitigations while you work on the most complex infrastructure renovations," advises Luta Security CEO Katie Moussouris. "Then develop a plan to keep the org healthy on an ongoing basis and make sure this plan itself is also reviewed for relevance and adjusted. Much of the technical debt that built up in the first place was due to an incorrect notion that whatever is working on the network shouldn't be touched in case it breaks.\u201d\n\nDo use all your resources\n\nThere are plenty of templates for incident response, though fewer that cover how to lean from incidents. Etsy\u2019s Morgue tracker is open source and the company has also published an excellent debriefing facilitation guide.\n\nThe learning review process is as much about as communications as technology. \u201cBusiness executive coaches who normally tackle lines of communications within the organization can address this area as well; not the technical aspects of where you need to pull information from, but what you do with it afterwards,\u201d says Nather.\n\nDo spread the word \u2014 inside and out\n\nPart of making sure the knowledge you can gain from an incident is applied is passing on what you\u2019ve learned.\n\n\u201cMake sure the resulting lessons are simply explained and made available for the entire organization to learn from,\u201d says Hinchcliffe. \u201cIt's this last part that is frequently omitted and can doom organizations to proverbially relive IT history over and over again.\u201d\n\nYou also want to share the lessons beyond your organization, suggests Nather. There are formal organizations like the Information Technology Information Sharing and Analysis Center, as well as similar organizations for financial services, oil and gas, healthcare, automotive, retail and legal, and plenty of informal routes for sharing intelligence. There\u2019s a value in supporting and formalizing that, suggests Nather.\n\n\u201cIf you have meeting space to offer for these folks to get together and talk, by virtue of being the leader who organizes it you immediately improve the standing of your own organization.\u201d\n\nInstead of treating failure as a threat to your reputation, sharing information says that you\u2019re mature enough to cope with problems and learn from them \u2014 and that\u2019s the culture you need to encourage.