by Dan Tynan

8 early warning signs of IT disaster

Feb 05, 2018
CIOIT LeadershipIT Strategy

The same systems keep breaking, shadow IT is on the rise, ideas are no longer flowing — sure misfortune awaits those who ignore early indications of impending IT doom.

it disaster man running from lightning storm
Credit: IDG/Stephen Sauer

There may be something rotten in your IT department, and if you don’t deal with it soon you could have a disaster on your hands.

Things may look fine now. But the warning signs are already there; you just haven’t noticed them yet.

The network is suddenly glitchy, simple problems are taking longer to fix, and some things just keep breaking over and over. Every massive code release is followed by a blizzard of bug fixes. Shadow IT is now business as usual. And you’re the last one to hear about changes in business strategy.

By the time your staff walks out, your website goes offline, your users have spun up their own data centers in the cloud, and hackers have put your customer records up for sale on the darknet, it’s too late.

Here are the early warning signs of potential doom — and how to stave them off. Ignore them at your peril.

1. Users stop complaining

You might think getting fewer complaints is a good thing. You’re probably wrong, says Oli Thordarson, CEO of Alvaka Networks, which provides IT services for midsize enterprises that need to run 24/7.

Fewer complaints often means that users have given up hope of getting their issues addressed, he says — and that can lead to all manner of bad consequences.

“A decrease in help requests doesn’t always mean the manager is doing a good job,” he says. “It usually means that the user community has lost confidence in that IT group. What comes next is growing shadow IT, alternate user support mechanisms, and then possibly firings and personnel shuffling.”

Thordarson says that when users complain, it means they expect the IT shop to be responsive to their needs. Every organization has some number of open tickets at any time; the trick is establishing a baseline for complaints, and then paying close attention if that number changes dramatically.

An uptick in complaints could be due to big upgrade or other major change; a declining number of support tickets could result from a significant process improvement or some long-standing problem being resolved.

“But if you can’t answer the question of why they’re going up or down, that means you have a big problem,” he says.

2. The lunch room is suddenly full of strangers

If you are grabbing lunch at work and are surrounded by people you don’t recognize, there’s a good chance your organization acquired another company and didn’t tell you.

That acquisition might be good for the organization, or it might not. Either way, your team will likely need to set aside strategic projects and spend time integrating the newly acquired company’s systems and data. That can cut into your ability to innovate.

Deep Varma, vice president of engineering at real estate site Trulia, saw this first hand when he worked at Yahoo during the mid-2000s. That was around the time when the search portal acquired adtech firm Overture, along with a host of other smaller companies. 

“Yahoo bought many small and big companies, so most of our time there was spent on integration, not on finding ways to improve the relevancy of search keywords and quality,” he says. “When I was there, my staff was always, ‘Oh my god, I’m spending so much time just doing integration’. That slows down innovation a lot.”

Granted, a lot of that is out of your control — you can’t exactly tell the CEO to stop acquiring companies. But you can integrate the parts the business leaders really need, such as analytics, while keeping products, roadmaps, and business units separate.

“Zillow Group [Trulia’s parent company] has done many acquisitions over the years, but our strategy generally has been to create a portfolio of brands that can stand on their own,” he says.

While Varma remains loyal to the Yahoo he once knew, he says the company didn’t think strategically about how each acquisition fit into its overall business, and stopped innovating as a result. That led to its ultimate demise.

3. You keep fixing the same problems

It’s rarely a single dramatic failure to brings an organization’s IT team to its knees; more often it’s the subtle, inexorable accumulation of technical debt.

“Hidden work with late nights, minor but unexplained outages, simple tasks taking increasingly longer to complete — a death by a thousand paper cuts is all too frequently occurring in organizations,” says Adam Serediuk, director of operations at xMatters, a notification and collaboration platform.

A certain amount of inefficiency is inherent in any organization, and most processes trade efficiency for effectiveness, Serediuk admits. But when the same systems continue to break over and over, and no one takes proactive steps to prevent it from happening, it creates a hole that’s incredibly difficult to climb out of. The result is usually employee burnout and high levels of attrition.

“There’s always a moment when somebody decides to leave an organization,” he says. “Like when they’ve spent their entire week dealing with the same problem for the 10th time, and a recruiter sends them a message on LinkedIn. It’s like, ‘You know what? I’ve had enough of this.’ And they move on.”

The best solution is to ditch the old problematic systems and start fresh with new ones, if you can.

“It’s easy to fall into the trap of sunk cost fallacy, when the right approach is staring you in the face: Rebuild and make it better with the learned knowledge from that experience,” he says. “Technology changes too quickly to carry the mistakes of the past forward.”

4. You’re shipping too much code

When you ship huge monolithic chunks of code, you vastly increase the chances something will go wrong — and risk a cascade effect that can bring down the entire system, says Bruno Connelly, vice president of engineering for LinkedIn’s site reliability team.

“While it’s tempting to knock everything out at once, big chunks of code with tons of tiny changes are much more complicated to work with,” he says. “And when something does go wrong, it can trigger other, more systemic failures.”

It’s much better to ship smaller amounts of code with relatively few changes, and ship it more frequently, he says.

“We have optimized our systems to ship code as often as we can,” he says. “We try to ship little bits of code constantly. That really ups our game in terms of validating that everything still has the same performance characteristics and downstream dependencies.”

The social network for professionals also makes sure it’s prepared for unexpected system failures by deliberately simulating them. Last November, the site launched its LinkedOut framework, which allows reliability engineers to artificially trigger failures in an application to see how gracefully the service handles it.

Once a day, LinkedIn also forces one of its primary data centers to failover, just to make sure it has sufficient capacity and the automation in place to withstand an actual data center disaster.

“If you’re not superconfident about your ability to survive a failover scenario, that’s another warning sign,” he adds. “You need to get comfortable with embracing failure by doing it constantly.”

5. Employees stop coming to you with ideas

When you challenge your team to tackle tough problems or come up with new strategies and all you hear is crickets, you know you’ve got a serious morale problem on your hands.

“If managers and users are coming to the CIO with ideas and enthusiastic solution proposals, that manager is doing a great job of leadership and management,” says Thordarson. “When users quit approaching with new ideas, they’ve either lost confidence in their CIO or they’ve created shadow IT.”

This could stem from the manager’s failure to encourage a culture of collaboration and experimentation, lack of maturity, or ego.

“I’ve seen companies where the whole IT team seems to have contempt for everybody else,” he adds. “If you start thinking your employer is just a host for you to ply your trade and geek out on new technologies, you’re not a very good asset to your company and it’s probably time to bring in a new leader.”

Employees could be reluctant to offer up new ideas because they’re simply worn out, says Serediuk. 

“When teams are burnt out you encounter a massive reluctance to change, even if that change improves their own lives,” Serediuk says. “They’re going to assume it will fail, because that’s been their experience so far. Every change so far has made their lives worse, so why would this one be any different? You need to be able to see that and respond appropriately to it.”

6. You’ve fallen off the cc: list

With IT management, no news is definitely not good news. If you aren’t clued in to important management decisions or participating in C-level strategy sessions, you’ve got a problem.

“Not getting invited to top-level executive management meetings is a key sign you’ve been disintermediated and are no longer relevant to the company,” says Thordarson. “It’s clear they don’t trust you and don’t think you have anything to contribute.”

Some blame goes to IT managers who don’t realize that, in order to gain respect from management, they need to frame technology issues in terms of business outcomes, adds Thordarson.

“You can’t just say, ‘We need new routers because the network’s really slow,’ or new software because you have to rebuild the database every night,” he says. “But if you tell them that rebuilding the database every night is costing them $2 million a year, you know they’re going to respond.”

Too often CIOs become infatuated with infrastructure and lose track of the bigger business picture, says Doug Bordonaro, chief data evangelist for ThoughtSpot, an AI-driven analytics company.

“Traditionally, CIOs have focused on security, compliance, data management, and other foundational tasks,” he says. “That’s no longer good enough in today’s digital economy. If you’re not spending an equal amount of time on monetizing data, enabling the line of business, and evangelizing the power of data throughout your organization, you may not be the CIO for long.”

7.  Your team is suffering from alert fatigue

IT managers know they need to constantly monitor critical business systems in real time. But having too many alerts is almost as bad as having none at all. 

“You might have 100 servers or 5,000, but your monitoring dashboard always has 30 open alerts,” says Serediuk. “They could just be informational, or known issues, but you still have these 30 red boxes staring you in the face. So when the one critical alert pops up, how will you be able to separate it from the 30 that are just noise?”

There are two potentially serious problems with alert fatigue, says John Bruce, head of solution engineering at SignalFx, a cloud-based monitoring platform. One is that IT managers eventually ignore noisy alerts, including potentially serious ones. The other is burnout and attrition.

He recalls visiting a prospective SignalFx client that was still using legacy tools to monitor a dynamic cloud-based hosting platform. 

“The systems they had in place to do the monitoring were so noisy that their operational folks were completely burned out,” says Bruce. “If you’re constantly getting paged at 3 or 4 in the morning with false alarms, that’s not a good feeling.”

Managers need to go through their backlog of issues and prioritize them, giving the most weight to issues that can impact customers, which in turn will affect them, says Serediuk.

The key is to be proactive, using metrics as early warning signs before problems start to impact users, says Bruce.

“You need early indicators that say, ‘OK, this service looks like it’s on a path to degrade; what can I do to prevent this?’ instead of, ‘OK, the server and client services are down; we need to jump in and fight this fire.'”

8. The FBI is on your doorstep

Data leaks and security breaches are on every CIO and CISO’s mind, but it’s not always obvious what they should be looking for. Big security problems are often preceded by lots of small signals, says Paul Moreno, cyber security expert and advisor to BugCrowd.

For example: Inexplicable system performance issues or a higher than usual outflow of data could indicate an attacker is trying to exfiltrate information from your company. A sudden spike in login attempts from new locations might mean an attempt to breach your customer data base is under way. Unusual requests to your APIs or administrative endpoints may be a sign someone is trying hack your network.

“If you’re not monitoring for any of the above, that would be a good place to start,” he says. “Having higher sensitivity monitoring and even autonomous triggers, such as lock out, for internal administrative endpoints is critical to any security suit of armor.”

It also helps to be proactive. Implementing two-factor authentication can keep thieves from using stolen passwords. Bug bounty programs can help identify vulnerabilities before the bad guys do, especially if your organization publishes responsible scope and disclosure guidelines. In addition, security intelligence providers can scan the darknet and alert you  if they find indications of a compromise that’s available to hackers.  

But the surest (and worst) sign?

“Having an FBI special agent or security provider reach out to your organization to check on recently acquired data that matches anything in your database warehouses,” says Moreno. “That’s usually confirmation a data leak has already occurred.”