Opinion: Recovering from a crisis

I’m not the best air traveller in the world. I get anxious, nauseous, and on some occasions a little too queasy for the comfort of my fellow passengers.

Any minor turbulence fills me with a sense of dread, comforted only by my fast escape aisle seat and the volume cranked up to the max on by iPod to mask those subtle deviations in engine noise – which for some inexplicable reason only I can ever detect.

So against this backdrop of angst and impending doom, why would I be drawn to watching episodes of the compelling TV show Air Crash Investigation?

Perhaps it’s my love of good drama, analytics, forensics, and, heavens forbid, not a morbid fascination for disasters.

In business technology today, TV shows like this remind us that however fault tolerant and resilient our hardware or how unbreakable suppliers claim their solutions to be, disasters will always happen.

And, while technology advances, process improvements and automation have lengthened the time between failures and limited the potential for human error, the spectre of business outages can still be the white knuckle ride for some CIOs.

Beyond the doom and gloom, watching this show has also taught me the value of incorporating stronger causal analysis into our daily IT operational practises.

This isn’t driven by a fixation with best practice frameworks, for example, the IT Infrastructure Library (ITIL) but rather a pragmatic and common sense approach to building the skills needed determine the exact cause of outages.

We also need to learn from our mistakes, and proactively implement the processes needed to prevent the same disasters happening in the future.

Just watch how the National Transportation Safety Board investigators go about their work on any episode and you’ll understand exactly where I’m coming from.

Here are some key ways to increase the efficiencies of your own ‘flight crew’ and calm down business users during an operational crisis.

Good communication trumps mass panic. During one mid-air flight crisis a pilot once famously communicated over the intercom to his passengers: “Ladies and gentlemen, this is your captain speaking. We have a small problem. All four engines have stopped. We are doing our damnedest to get them going again. I trust you are not in too much distress.”

This was an example of over communication, but to me it struck a reasonable balance we can all follow in IT. That is, calmly and authoritatively informing users what they’ll probably know already, without going into minute detail on the exact nature of the technicalities.

Listen to the crew, not just the captain. Although an airline captain may firmly believe he or she has everything under control (with junior flight members cowering in the background), we can’t rely too much experts and fall victim to a ‘rock star’ mentality.

Too often I’ve seen cases where service desk and IT operations managers firmly believed they alone could fix the problem without considering sensible workarounds or advice from junior staffers.

Or worse still, the blame games that ensue when a problem is handballed from one crew member to another (from operations to networks to applications, and so on).

These issues persist because of the factional silos upon which we’ve built and maintained IT, but can no longer be tolerated if accelerated delivery cycles and continuous delivery become more entrenched in our thinking.

This I believe predicates the need for stronger team cultures where problems are treated as the enemy and where we have a much healthier approach to failure.

To this end, cultivate those IT professionals who can evolve their thinking into how IT services should be delivered and supported, paying closer consideration to agile and more modern DevOps techniques.

Expect the unexpected. The best in aviation technology has been comprised by the strangest of events – from micro-burst storms to insects, from flocks of wild geese to volcanic ash particles.

Similarly in IT, the latest and greatest in converged fabric, software-defined networks and grid computing combined will still succumb to factors that are often beyond our control.

And now with the increased validation and use of cloud service delivery models, social computing and data analytics, we can add a more challenging set of business risk factors to the technical mix.

These include legal and regulatory issues around data privacy, hybrid IT service management, cloud contracts, vendor sourcing and service brokerage.

In God we trust; everything else is automated right? Not necessarily. If you’ve watched air crash, how many times have you seen pilots become fixated on problems with their advanced instrumentation and automated controls while being unaware that the airplane was gradually spiralling out of control?

Similarly as we incorporate more automated provisioning and management technologies into daily operations to human error, we too can fall victim to what many refer to as the ‘irony of automation’.

That is, the more reliable our IT systems and applications become, the less attention is paid by our teams, which as a consequence often means failing to react (or even notice) when things are going wrong.

As automation systems incorporate more analytical capabilities, these situations may improve, but in the meantime work with your teams to rigorously sample and test your systems and revisit manual fall-back procedures.

Remember too, the accumulated knowledge your teams have gained over the years of maintaining fragile and legacy infrastructure shouldn’t become lost like some obscure language dialect.

Fixing problems will often by the easy part. It is a well established fact that the time spent between IT service failing and being recovered, 90% is spent detecting you have a failure and then pinpointing the root cause.

But as business expectations increase and the life expectancy of applications shortens to weeks and perhaps even days, the benefits of traditional proactive problem management should be supplemented with design for failure development methods.

This is to detect and remediate failure while still maintaining an acceptable level of service – a little like landing a 747 on three engines without interrupting the drinks service.

Accept that even though the best technologies have problems, it’s our time-tested experiences which serve as the basis to manage and learn from them.

Of course automation tools and best practice frameworks can help, but start first by bridging the organisational process divide that can often exist between your development and operations teams.

This comes by enabling trust, through shared problem ownership, and from closer collaboration. Only then can we stop watching air crash re-runs and feel safer in the window seat.

Miriam Waterhouse is the CIO at the National Film and Sound Archive, Canberra

Copyright © 2013 IDG Communications, Inc.

Security vs. innovation: IT's trickiest balancing act