Artificial intelligence and machine learning can slash the number of false alerts that tie down operations staff, speed troubleshooting of problems, and help developers and architects understand and manage fast-changing, cloud-based IT environments.
But CIOs should not expect what some customers call “magic” results, such as automatically predicting and fixing any conceivable IT issue, or even just accepting any log or event steam and analyzing it without any data cleansing or normalization.
AIops is the use of artificial intelligence to manage, optimize, and secure IT systems more quickly, efficiently, and effectively than with manual processes. Market researcher Gartner estimates that the AIops market ranged between $900 million and $1.5 billion in 2020 with a compound annual growth rate of around 15% between 2020 and 2025. Along with standalone AIops platforms, many IT observability, management, and monitoring tools integrate with AIops platforms or have added AI capabilities to their products.
AIops is best, according to customers and analysts, at quickly scanning massive amounts of data from hundreds or thousands of sources to filter out the most important alerts or identify underlying trends, as well as quickly detecting new elements such as application programming interfaces (APIs) that link applications— those “things that human intelligence can no longer handle,” says Sean Mack, CIO and CISO at Wiley, a global leader in research and education. It is ideal, he says, for providing insights into IT issues among “the exponential growth of the complexity of our systems and services,” with virtualized elements that “may be there one second and may not be there another second.“
But AIops efforts can fail if businesses don’t understand its limits.
Where AIops excels
Identifying patterns. A common and successful use of AIops is to reduce the “noise” from alerts that either duplicate other alerts, reflect normal changes in the IT infrastructure, or don’t affect critical business processes.
Intelligent analysis of operational data can identify common patterns, such as a surge in traffic early in the day when users log on or during quarterly financial closes, to understand which patterns are normal and which might signal problems, says Stephen Elliot, group vice president at market researcher IDC. It can also identify recurring problems such as overloaded servers to help operations staff apply a fix before the issues affects users. Correlating multiple alerts to a single underlying problem can also reduce the load on operations staff and speed root cause analysis of issues, he says.
While “early in [its] AIops journey” using New Relic’s observability platform, pharmaceutical distributor AmerisourceBergen has seen a two-thirds reduction in alerts that do not need action, allowing its engineers to focus on important issues, better prioritize incidents, speed root cause analysis and increase application availability, says Vice President of IT Operations Paul Stuart. At Wiley, Mack’s staff used Dynatrace’s AIops capabilities to reduce the number of false positives by more than 50 percent. When issues do occur, Wiley has reduced its mean time to resolution by more than 37 percent, which Mack calls “a huge, huge improvement.” All this allows his team, he says, to devote more time to improving the customer experience and delivering innovative new services.
Monitoring and tracking. AIops can also make it easier for operations staff to track changes in their IT environment, monitor its performance, and cost-effectively manage larger environments. “ We are currently in the middle of a large acquisition,” says Stuart. “By leveraging AIops, we can take on additional monitoring load without a substantial increase in headcount.”
Airport parking provider Park ’N Fly uses the Dynatrace AIops platform to monitor its own IT infrastructure as well as APIs that provide information from partners, such as those allowing customers to track the location of its shuttle buses and purchase maintenance for their vehicles while they’re traveling, says Senior Director of IT Ken Schirrmacher. Dynatrace also automatically discovers new components like servers Park ‘N Fly hosts in the cloud, “analyzes its behavior such as the data it is accessing and the other applications it sends that data to,” creating a web topology that tracks how components of its IT infrastructure integrate, he says.
One use for AIops at Wiley is managing event logs to not only observe, but to understand the reasons behind the availability and reliability of its systems, says Mack. “Monitoring has become passé,” he says. What he needs is “observability, meaning the ability to ask questions and get answers. Monitoring may show you the latency (of systems) every second but the question I want to ask is ’Why is one user in Timbuktu having a problem?’”
Getting to root causes. AIops is also useful for speeding the root cause analysis of problems, helping to determine “At what layer of the service map does (the problem) exist—at the browser, in the database, in the code (or) is it an on-premise network issue?” says Elliott. Wiley correlates data from all layers of the application stack, including database and application performance and how users experience its applications and services, and has use Dynatrace and other tools to drive a 40% reduction in mean time to resolve issues. “This means serious improvements in performance for our customers,” he says.
Several customers warned that AIops requires configuration and often won’t produce short-term cost reductions. “You won’t see upfront savings” during the implementation phase, says Schirrmacher. “The benefit is largely down the road when you need fewer employees to manage your growing environment, to run it optimally, no longer need to schedule staff for late-night updates or to resolve outages, or to schedule updates around holidays.
Where AIops falls short
Dealing with data shortcomings. The more data, and higher quality data, a machine learning algorithm has the better it can understand and analyze the workings of a complex IT infrastructure. The lack of such data, or limits on which data an AIops platform can leverage, can limit the effectiveness of AIops, making proper data management a crucial element of successful AIops.
“Our early AIops efforts struggledbecause vendors couldn’t live up to their promise to accept our ‘messy’ data and use it to identify anomalies and problems within the IT infrastructure,” says Danske Bank’s head of service reliability and observability, Vilius Ellikas. Danske Bank “sees high potential” in its use of the StackState observability platform to automatically aggregate, correlate, and tagdata so our systems can seewhich infrastructure components support which applications and services,” he says. This helps the bank “get the basics right before we get to the magic of machine learning.”
Notified, which uses a cloud-based infrastructure to provide communication and hosting for corporate events and communications, is running its first AIops proof of concept using the AIops capabilities in Splunk and New Relic, says CTO Thomas Squeo. While AIops is useful for speeding root cause analysis and event aggregation, he says, Notified is still aggregating the historical performance data necessary for predicting the amount of cloud resources it needs for large-scale events such as investor relations conferences.
Consolidating the required operational data about its infrastructure was important for AmerisourceBergen. “One of our top pain points was having siloed environments looking at their set of tools and areas they supported rather than the overall view,” says Stuart. “Now that we have all the data centrally located, our AIops engine can correlate alerts from different sources, allowing AmerisourceBergen team members to quickly focus on the core issue. By correlating all the data into a single location, we can start identifying patterns that are early warning signs of trouble brewing.”
Automated remediation. Fully automated remediation of security, performance, or other problems is another area where AIops can fall short of vendor promises. “AIops is dramatically under-delivering if customers want a ‘magic box’ that can instantly and continuously find problems and suggest the ideal remedy for them,” says Gartner Inc. Senior Research Director Gregory Murray.
Some risks, such as the exploitation of a previously unknown security vulnerability, are difficult or impossible to predict, he says. “It is also impossible for any AI system to evaluate all of the combinations of changes to the IT infrastructure and reliably predict the effect of those changes.”
“Some IT organizations are starting to chip away at what they’re comfortable auto-remediating,” says Elliott. “In some cases, it is the bursting of new services or new infrastructure” to prevent performance degradation when transaction loads or needs spike, while in others it may be automatically moving services to a different AWS region or a different set of resources.
Notified is currently performing automated remediation on only 20% to 25% of the application portfolio “…on a risk-adjusted basis,” says Squeo.
Culture shift ahead
For some, AIops is less a standalone discipline than one more tool for agile IT and business processes. IDC calls it “IT operations analytics” and at Notified, “We don’t use the term AIops,” says Squeo. “We use the term `devsecops’ which assumes the existence of good monitoring, notification, and event practices and taking advantage of AIops as part of the overall cooperation between development and operations and security.”
At Wiley, AIops is part of a broader move to give more responsibility for application and service quality to the teams developing them. “We take a devops approach (to) our reliability and management,” says Mack. “Ultimately, accountability is (with) the teams building the systems” who have the most at stake in how they perform in production.
Stuart predicts AIops will eventually facilitate “a team-wide cultural shift, where automation becomes the focus” rather than on manually responding to problem as they occur. “As we mature, the focus will be on viewing the environment from a service perspective that will combine application and infrastructure components with business drivers.”