Incident Response: How To Keep Tech Problems from Becoming People Problems

BrandPost By Sidharth Suri
May 18, 2017
Cloud Computing

In your rush to resolve incidents, don’t forget the most important element: communication.

When one of your IT services is on fire there is no time to waste. Especially if that fire is blocking your users from getting stuff done. Rapid resolution tends to eclipse all else during an incident, often causing your team to ignore or forget pieces of the incident response process – like keeping people in the loop.

It’s one of those little problems that compounds into a big one if not handled correctly. Pretty soon, you’re stuck in an endless loop of shoulder-taps and email threads, trying to explain to the CEO why things went wrong. While there’s no shortage of tools to help your team detect, alert, swarm on, and resolve incidents, even the best tools can’t replace clear communication to internal and external stakeholders.

And let’s be real: The stakes can be high, very high. Reputation, customer attrition, time spent on damage control, just to name a few.

Luckily, downtime doesn’t have to turn into a customer service nightmare. Informed users are happy users. But first you need to know who to communicate to, how to reach them, and how to do it with the least friction and fewest resources possible.

Communication during times like this is like ripples from a rock tossed into a pond. The circles closest to the incident get the biggest, most frequent and most immediate feedback. This is your core on-call team – AKA the folks who need to identify and fix the problem. It’s a small circle, but the ripples (communication) need to be big, immediate, and frequent. As you move further from the core circle — to adjacent IT teams, managers, the organization as whole, end users and the general public — the audience gets bigger, but the ripples get smaller and less frequent.

While every organization is different, in general it helps to think of these audiences as 5 distinct groups that need to be communicated with:

  1. Core on-call team: The first to know something is wrong, almost immediately upon impact (usually from monitoring and alerting tools).
  2. Front-line support team: Those who will be directly answering questions and giving customers updates during the incident. It’s an incredibly important role, so this team must get the right information to pass along to end users.
  3. Managers and executive team: The core team needs to communicate with this group so they know what’s going on, the potential impact on the following two groups, and hopefully an estimate of how long it could last.
  4. General employee population: Employees need to be kept informed as services they rely on go down and up. Proactively communicating with these users means less “what’s the status of this” questions, fewer duplicate IT support tickets, and more focus to fix the problem at hand.
  5. External customers: If the incident affects external customers some communication must be sent out to explain the problem and when they can expect a fix – or at least an update every nth amount of time. For issues that are still currently affecting your customers’ ability to use your product, we recommend never going more than one hour without sending an update. You should also always indicate when to expect the next update. If it is a severe enough incident – especially one involving security or data loss – you will definitely want to expedite external comms and pull in the necessary other teams (legal, HR, security, etc.)

xMatters and StatusPage are tools that have an interesting intersection between integrating solutions across your technology stack and then communicating status information out to drive workflow. With some of the biggest cloud companies as customers, we’ve seen how the highest performing IT teams are resolving incidents more efficiently while keeping users happier through a solid incident communication plan.

Creating your own incident communication plan:

Before an incident:

  • Define priority/severity levels (how many users are affected, how long the incident lasts, etc.)
  • Create incident templates for common issues to save time between detection to communication
  • Document defined roles during an incident (how to identify the incident commander, who owns the communication, etc.)
  • Determine how to communicate with affected users (what channels will be used for each priority level, etc.)

 During an incident:

  1. Communication with first responders: Alert those “on-call” and make sure they know where to go for more information about the problem. A tool like xMatters can help drive resolution by relaying data between systems while engaging the right people.  This way, you never have to worry about keeping your technology infrastructure aligned with key resolution processes.
  2. Communication with affected users (both internal and external) and other stakeholders (i.e. executives): Use your pre-determined channel(s) to tell users what’s going on. This may be e-mail, a blog, Twitter, or a status page where they can subscribe to notifications about services they care about most.  Whatever tool you choose to use, we recommend that you identify one as your primary communication vehicle and funnel everyone there from the other channels. For example, we have a dedicated status page but we also tweet out updates and display a notice in our webapp during downtime. The tweets and in-webapp notices funnel users back to the status page for the full story.

After an incident:

  • Hold a retrospective on the incident and figure out what (if any) post-incident comms are necessary — as well as what you can do to prevent similar incidents from happening again.
  • If necessary, send out your postmortem to affected users. A good postmortem can actually generate a lot of goodwill with your customers. Ideally it will enable you to:
    • Apologize personally
    • Explain exactly what happened and how your team was able to fix it
    • Talk about your plan to avoid a similar situation in the future

Even 99.99% of uptime means 52 minutes of downtime a year. Every IT team should be prepared for those 54+ minutes. Providing legendary service isn’t just about resolving incidents quickly – it’s also keeping users informed while you do.  Learn more about using xMatters for IT alerting and StatusPage for IT incident communication and see how they can work together to increase transparency.