10 Secrets to Troubleshooting Technology Problems
Years of experience has revealed some key approaches to resolving problems like network outages.
Mon, October 04, 2010
Computerworld — I recently joined my team in troubleshooting a complex infrastructure problem affecting the private cloud that hosts our community electronic health records system. The incident put me in mind of the things I have learned from such experiences over the years.
15 Internet Annoyances, and How to Fix Them
Standardize This! 10 Technology Messes That Need Fixing
1. Once the problem is identified, ascertain the scope. Call the users and ask them what they are experiencing. Test the application or infrastructure yourself. Do not trust the monitoring tools if they indicate all is well but the users are complaining.
2. If the scope of the outage is large and the root cause is unknown, raise alarm bells early. It's far better to make an early all-hands intervention with occasional false alarms than to intervene too late and have an extended outage because of a slow response.
3. Bring visibility to the process by having hourly updates, frequent bridge calls and multiple eyes on the problem. Sometimes technical people become so focused, they do not have a sense of time passing or insight into what they do not know. A multidisciplinary approach with predetermined progress reports prevents working in isolation and the pursuit of solutions that are unlikely to succeed.
4. Although frequent progress reports are important, you must allow the technical people to do their work. Senior management feels a great deal of pressure to resolve the situation. However, if 90% of the incident response effort is spent informing senior management and managing hovering stakeholders, then the heads-down work to resolve the problem cannot get done.
5. Remember Occam's razor: The simplest explanation is usually the correct one. In our recent incident, all the evidence pointed to a malfunctioning firewall component. But all vendor testing and diagnostics indicated the firewall was functioning perfectly. Some hypothesized that we had a very specific denial-of-service of attack. Others suggested a failure of Windows networking components within the operating systems of the servers. Others thought we had an unusual virus attack. We tested the simplest explanation by removing the firewall from the network, and everything came back up instantly. It's generally true that complex problems can be explained by a single simple failure.
6. It's very important to set deadlines in the response plan to avoid the "just one more hour and we'll solve it" problem. This is especially true if the outage is the result of a planned infrastructure change. Set a backout deadline and stick to it. This is similar to when I climb or hike; I set a time to turn around. Summiting is optional, but returning to the car is mandatory. Setting milestones for changes in course and sticking to your plan regardless of emotion is key.


