by Joel Pomales

The value of identifying the ‘failure stream’

Sep 08, 2016
IT Leadership

Understanding the value stream of a process is important. But it's also important to understand the 'failure stream.' Start by asking, 'At what point can things go wrong?'

This is Part 1 of a two-part series related to how enhancing your value stream mapping activities can benefit your IT service management (ITSM) improvement effort.

Many things have been written about value stream mapping and the benefit it brings in analyzing the current and future states of products and services. I really won’t go into that here. You can probably go onto the internet yourself and read line upon line about how this is something that you should do to understand your activities and processes.

Lean thinking in IT is not something new either, but its principles have been codified and collected in LeanIT, something that is being talked about very frequently now in combination with other disciplines such as DevOps and agile. But I’ve been thinking about something that we may have overlooked — unintentionally — that may be of benefit to many organizations. In my opinion, we’re not thinking about the “failure stream” enough.

You see, when we develop a value stream map, we lay out our activities (normally after a SIPOC analysis) and we put our timing information, lead time and processing time, and then start analyzing how an object (a widget, an incident, a change) travels from Point A to Point B and how we can make it go faster and more efficiently through the value chain by identifying issues that stand in the way, like waste.

Things break, and in IT this happens every day. So on top of any waste factors that we may have in IT that affect the efficient execution of an activity, sometimes stuff happens. For all of the agile and DevOps goodness we may apply to many of the things we do in IT, at some point in time something may happen that will irremediably break the value chain. If we don’t account for these things, we may not be prepared to design and respond around them.

I can’t quite claim to be an expert in value stream mapping, but it occurred to me that if we were to look at a VSM and identify possible fail points, we can not only identify them, but also make sure that we take actions to improve other areas based on where we think these are.

So this is what I was thinking about: Take any VSM of an IT service and draw it like you normally would. Enter the processes, activities, players, lead time, processing time and what have you. This does not change one bit. At some point in time, you may want to draw a line in the diagram that asks the following question:

How does this fail?

Think about it as a virtual decision box. Look at your value stream map and ask yourself, knowing what you know about the process, “At what point can things go wrong?” Is it at the beginning because of, say, poor intake practices of an incident at the service desk? Is it in the middle because of too many steps, or a choke point (i.e., a single person approving things, or a single subject matter expert who has to handle all incidents) in the process? Can you identify a resource constraint that can lead to the process failing?

You might not know everything about the process, so this is the part where you ensure that you have people who can answer these things for you and give you the insight you need. You want to have people from the business, developers, database administrators and operations people (especially service desk resources) to tell you where things can break. Doing this without including the business and every element of your IT organization is a recipe for misinterpretation and failure.

You don’t need to overpopulate your VSM with a list of the things that can fail in a process. Just use an icon that everyone agrees on (a red X, a sad smiley) and list those issues on a separate piece of paper — or tab, if you’re doing it electronically. Now that you’ve done this, you’ve created a couple of “forks” in the road that, in theory, go down on your VSM. Now you have information on several “fail” states.

Now you have something that broke and deviated from the ideal path you set in the VSM. How does it come back to it? How does your organization respond and make it right? This is where your best minds start thinking, at a high level, “How will we fix it?” Is it a new code push? A server restart? Reaching out through managed communication efforts via the service desk? The list can be extensive, but I would suggest focusing on a couple (two or three) actions that correct that failure. Remember, you want to recover fast, not get to the root cause of the issue.

The last step is to then visualize what success looks like. If you have failure X, and you apply action Y to it, what does Z (success) look like? It may not be pretty, or ideal, but you need to visualize it and understand what it looks like. For example, we have a running application or server (even at a degraded state) or we have a (mostly) satisfied customer after fixing his or her problem.

Putting all of these things on a VSM requires you to do understand a couple of things.

It requires you to critically look at your activities in the VSM and admit, in true honesty, where things can go bad

It also requires you to involve people from all over your IT organization. (That bears repeating many times; you can’t do this in a vacuum!)

And it requires you to think about improvements. If, after carrying out this exercise, you know how things fail, you can right then and there identify how to instigate improvement initiatives, possibly before things break in the first place!

Now, I’m not going to say that the number of things that can potentially “break” your services has to go down to zero. Things will happen, and you might have not prepared for them even after performing this analysis. But doing this, constantly, will give you good improvement information for your services and your resources.

I would also like to add that you don’t want to do this from a panic perspective. In other words, doing it after something breaks would be part of a problem management or improvement initiative that you might identify. Rather, you want to do this periodically whether something is broken or not. You can do a work session every two weeks, for example, or you could do it through an online forum mechanism (think Google Groups or Sharepoint)

I do hope that this is something that can help your organization identify potential failure points and help identify where you can identify improvements. In my next post, I want to talk about moments of truth, and how you can use this in combination with a VSM to identify and initiate improvement opportunities in your IT organization.