So here’s what I’ve been thinking of lately. There is such a thing as a difference between a user-facing incident and a service-facing incident. My favorite metaphor for this is a hospital emergency room setting. There are many services offered within the ER, outside from the ER itself. There’s radiology, lab analysis, CT scans and so on. These services are used by many of the patients in the ER and the loss or degradation of a service, for example, is detrimental to the operation of the ER. If you have four radiology machines and one breaks down, you will not be able to handle as many patients as if you had all four. That can lead to long wait times… and unhappy patients.
Same thing happens in IT. You have email, collaboration software, HR, time and expense services and so on. These are things that (most) everybody uses in the company. The loss or degradation of one of these services presents an impact to the entire company that can lead to the loss of business, unhappy workers and all-around woe.
In both cases there is a known quantity of users that consume the services at any given time. The ER probably calculates how many radiology machines they need based on past usage. In IT we know (we should know!) how many people are using any given service at any given time, when those peaks and valleys in usage are and what does a degradation or an outage looks like. These are elements that come from design. We should know, for example, that there are 150 employees in the company and that 150 of those use email versus the five that use, say, the accounting software.
So service-facing incidents are normally for business impacting issues, cover many users and are directly related to supporting many users. But there may be service facing incidents that never, ever are noticed by users. For example, a power supply goes down on a server that has redundancy, or the same thing happens to a hard drive in a storage array. They’re incidents, but they’re either (a) quickly corrected or (b) happening on a high availability environment (the cloud, for example) that users will never experience downtime. The opposite to this is also true: if a server of a sufficient size goes down, or we lose the network link to the internet, and to the cloud, we’re dead in the water.
What about the user then?
(There will be some slight ITIL bashing now)
ITIL, in its most current version (and this can change in version 4), doesn’t quite make a distinction between something that affects a user versus something that affects a service. The formal definition for incident in the ITIL framework reads:
“An unplanned interruption to an IT Service or reduction in the quality of an IT service. Failure of a configuration item that has not yet affected service is also an incident — for example, failure of one disk from a mirror set.”
Which is a fine definition, but it doesn’t quite address things from a user’s perspective. What follows in that particular chapter of the framework also aligns with this thinking. And in this regard is where I believe ITIL is not very good. Other frameworks and approaches also lump user and service incidents together in one way or the other (The only one “framework” that presents a different way of thinking here is Ian Clayton’s eponymous work “The Guide to the Universal Service Management Body of Knowledge” where he has a Service Request life cycle as a way to request something from the service provider organization that may result in the consumption of services. (p.p. 211).
Here’s why I think that we should start separating these two types of incidents.
For users the impact can’t be low. The impact is always high. From a customer’s perspective, if something happens where he or she can’t work, the party impacted is always one. Just like when you go to the hospital and you have a broken bone. It is you that’s hurting. Your impact is high because you have a broken bone and you’re hurt, but what happens if a guy comes in at the same time with a heart attack? The other guy’s dying, he’ll probably get sorted out first even though the impact for both is high for different reasons.
So if a user is experiencing an issue or an error that impedes him to work, it really doesn’t matter to him or her if the service is up and running and Susie down the hall can work. He can’t. He can’t work, he can’t be productive. So knowing that the impact is always high, it will be up to the Service Desk to work with the user to get their urgency. Is the user under 100 percent work stoppage? Can the user limp along using a web interface as opposed to a ‘fat’ client? Can he work from another computer? Can he use the printer down the hall? This helps set expectations, but the impact is always high.
There should be a logical link between user- and service-facing incidents, though. They’re not completely separate. This link can be made the same way in which we link incidents and problems. They influence each other, but one does not directly cause the other. Also, many (a majority) of service-facing incidents should be created and handled by automation.
Maybe one of these days an adventurous soul will try this. If I do, I will let you know. I think there’s value in this, and it allows your organization to be more user-centric.
Categorization changes, too, and can be made simpler and more meaningful for both users and technical resources. But that is the topic of my next article. Stay tuned!