Artificial intelligence was once a magical concept, the stuff of science fiction. Now, after decades of research and commercialization, it’s just another foundational tool to keep the enterprise stack running.
Nowhere is this more evident than in the world of DevOps, a data-rich, back-office practice that presents a perfect sandbox for exploring the power of artificial intelligence. The teams in charge of operations now have a burgeoning collection of labor-saving and efficiency-boosting tools and platforms on offer under the acronym AIOps, all of which promise to apply the best artificial intelligence algorithms to the work of maintaining IT infrastructure.
AIOps is among the better use cases for artificial intelligence. Servers and networks generate petabytes upon petabytes of data. We know when processes start and stop, surge and ebb, often down to the millisecond. RAM and CPU demands are often well-understood and so are the prices for renting hardware in the cloud. All are often calculated down to six or seven significant digits. Creating an autonomous car may mean struggling with a world filled with pedestrians, livestock, and shadows, but when it comes to IT infrastructure, everything is already digitized and ready for analysis.
Some of the simplest tasks for AIOps involve speeding up the way software is deployed to cloud instances. All the work that DevOps teams do can be enhanced with smarter automation capable of watching loads, predicting demand, and even starting up new instances when the hordes descend.
Good AIOps tools generate forward-looking guesses about machine load and then watch to see if anything deviates from these estimates. Anomalies might be turned into alerts that generate emails, Slack posts, or, if the deviation is large enough, pager messages. A good part of the AIOps stack is devoted to managing alerts and ensuring that only the most significant problems turn into something that interrupts a meeting or a good night’s sleep.
These methods for watching for unusual levels or activity are sometimes deployed to bolster security, a more challenging task, making some AIOps tools the purview of both security watchdogs and the DevOps team.
Sophisticated AIOps tools also offer “root cause analysis,” which creates flowcharts to track how problems can ripple through the various machines in a modern enterprise application. A database that’s overloaded will slow down an API gateway that, in turn, freezes a web service. These automated catalogs of the workflow can often help teams spot the real problem faster by documenting and tracking the chains of troublemaking.
Many of the tools in this survey are built on monitoring systems with a long history. They began as tools that tracked events in complex enterprise stacks and have now been extended with artificial intelligence. A few of the tools began in AI labs and grew outwards. In either case, anyone evaluating these platforms will want to look at the range of connectors that gather data. Some AIOps platforms will better integrate with your stack than others. All offer a basic set of pathways to collect raw data, but some connectors are better than others. Anyone considering adopting an AIOps platform will want to evaluate how well each AIOps offerig integrates with your particular databases and services.
Here are 10 of the leading AIOps tools simplifying the job of keeping enterprise IT infrastructure humming.
AppDynamics is a division of Cisco that specializes in performance monitoring. It has added machine learning to its flagship platform to watch for metrics that diverge from the historical baseline. The system can build a flowchart and learn how events can cascade until system failure, thereby helping identify root causes. AppDynamics pushes correlating these metrics with hard “business outcomes” such as sales numbers and a “self-healing mentality” for its platform by providing links that can automate the resolution of common failures.
BigPanda focuses on both detecting strange behavior and orchestrating the teams assigned to solve it. Its eponymous platform offers root cause analysis and event detection that integrates with the major cloud providers. Its “Level-0 Automation” handles the workload that comes after a problem appears. BigPanda simplifies the workflow by creating tickets, sending out alerts, and even starting up virtual “war rooms” for serious issues.
Datadog recently added the Watchdog module to its performance management tool so DevOps teams can ask for automated warnings when performance begins to fail. The tool builds performance forecasts based on historical records adjusted for season and time of day. Changes in metrics such as latency, RAM consumption, or network bandwidth can trigger alerts if they depart from norms. The tool is integrated with Datadog’s security detection system, and it can work with virtual machines, cloud instances, and also serverless functions.
Dynatrace is a broad, full-featured monitoring tool for tracking cloud-based VMs, containers, and other serverless solutions. It sucks up log files, event reports, and other triggers to deliver what it calls “precise, AI-powered answers.” The core is called Davis, a deterministic AI that constructs flowcharts and trees so that it can pinpoint the root cause of any anomaly or failure. If it’s properly configured, it can run autonomously by triggering changes that should fix the cause. It could be as simple as rebooting an instance, but it might happen without waiting for a human to get in the loop.
Most AIOps tools are designed to help software that’s already up and running. Github Copilot starts earlier in the process, helping when code is first being written. The tool watches what a programmer types and makes suggestions for how to complete it. It was trained on a gazillion lines of open source code so these ideas are grounded in some form of reality. There are still questions that are somewhat philosophical about who is the ultimate author of the new code, whether the AI can be trusted, and whether the millions of open source coders out there deserve some kind of credit or hat tip for assistance. The answer may be “perhaps.” A bigger question is how much better does Copilot understand your code and does it really do much better than autocomplete. That answer is that it probably varies.
IBM Watson Cloud Pak for AIOps
IBM created the “Watson Cloud Pak for AIOps“ by integrating its general Watson brand AI with its larger cloud presence. The tool brings automated root cause analysis to the data collected from the cloud monitoring software. When the events reach a configurable level of severity, they can trigger either basic alerts or more automated responses from the toolchain. IBM has integrated the results with its other Cloud Paks for providing Network, Business, and some Robotic Process Automation.
LogicMonitor calls its AI “LM Intelligence.” It bundles a root cause detector with an alert system based on dynamic thresholds adjusted from historical data. Its early warning system depends on a forecasting module that’s extends this historical data to compute thresholds on latency, bandwidth, and other metrics. LogicMonitor prioritizes reducing “alert fatigue” to help teams focus their efforts on truly anomalous behavior. The data collectors tap into the major clouds and watch compute resources (Kubernetes, containers, etc.), network traffic, and storage systems (databases, buckets, etc.).
Moogsoft is a specialized AI engine that integrates with major performance monitoring tools such as New Relic, Datadog, AWS Cloudwatch, and AppDynamics. If your stack is running something different, such as open source or in-house solutions, Moogsoft professes the desire to integrate with “anything, anywhere and anytime.” The product moves the data through a pipeline that de-duplicates events, enriches them with contextual data from other sources, and then correlates the data before raising an alarm. The clustering algorithms and historical records help reduce the noise and produce more useful reports of problems.
New Relic One
New Relic added an AI engine to its performance monitoring tool One and it tracks all events ingested, including those from other tools such as Splunk, Grafana, and AWS’s CloudWatch. The tool can be configured with flexible levels of sensitivity for a variety of events of potential severity. You can tell New Relic that, for instance, a low-priority error should raise an alarm only if it occurs several times over fifteen minutes. But a high-priority event like a crashed server will generate a pager alert immediately. The issue log tracks all events and includes a Correlation Decision report that lays out the logical steps taken by the AI en route to raising an alarm.
Splunk began as a tool for gathering log files and building a comprehensive reporting tool for tracking performance, identifying anomalies, and helping the team diagnose problems. The product integrates informational graphics with a deep indexing tool to catalog the events. Artificial intelligence and machine learning algorithms within Splunk can anticipate problems and understand their source. These algorithms track all of the services integrated with Splunk to find the root causes. The machine learning features are deeply integrated with the platform so that service engineers skilled at tracking performance can leverage the best machine learning without much additional training. They can track the historical performance and watch for divergence through the main dashboard.