AIOps: Creating a Closed-Loop Support System to Streamline IT

istock 1197563732
istock

In five short years, artificial intelligence for IT operations (AIOps) has evolved from a futuristic concept to a standard practice for enterprises that place a high value on getting ahead of the break-fix model of IT support.

AIOps proposes a solution for several sources of stress that IT operations (ITOps) face today. IT environments are becoming too complex to operate manually. The breadth of technology ITOps needs to embrace is exponentially increasing. Computing power is moving outside the data center, to the edges of the network and infrastructure problems must be addressed at ever-increasing speeds. Rather than try to outrun these trends, enterprises are throwing automation at the problem – using big data analytics, machine learning, and other AI technologies to help identify and resolve IT issues. Enter AIOps.

AIOps typically has four basic steps: monitoring, analysis, recommendation, and remediation. While monitoring and remediation are important bookends, the middle two steps – analysis and recommendation – form the key components IT support providers need to master to execute a successful AIOps strategy.

The objective is to identify emerging issues and apply corrective action to the customer base. This reduces the time it takes from fault identification to resolution, not just for the customers who have the problem, but for all other customers who may be at risk – or have yet to identify the issue themselves.

However, given the pace of change in a typical ITOps environment, IT support providers need a continual improvement cycle that can adapt real-time based on factual experience to create a successful approach to addressing customer challenges.

To achieve this, service providers must identify best practices, through recommendation adoption, deviation identification, and ultimately the definition of “known good.” With a broad customer base offering recommendations, potential issues can be identified along with operational behaviors that improve IT efficiency. These elements then become the basis for preventative recommendations.

Starting the process

This continual improvement cycle starts at the crossroads of product engineering and support, focused through the lens of elevated case management. To be successful it needs to prioritize, identify, and eliminate issues that require human intervention.

To do this, service providers need to have effective telemetry monitoring, dashboarding, and data analytics capabilities to track those trends. Strong product engineering, support engineering, and data science teams are required to analyze telemetry at scale to identify new threats, prioritize them, refine rapid diagnostic capabilities, and isolate causation. AI tools assist with the volume of data to drive accuracy, ultimately allowing the predictive identification problems before they can cause disruption to the customer. Customers can then be given remediation steps to solve the problems prior to significant disruption to their environment.

This starts to outline the components of a continual improvement cycle. Successful service providers need to constantly do three things: monitor the health and performance of their installed base, develop new detection models, and provide recommendations to customers. They need to be able to solve the problems of “patient zero” – the initial customer who had the problem. Because all customers are sending telemetry, by using pattern recognition, they can act proactively, identify, and help customers who have the same risk profile before these problems impact their ITOps.

Simple, common IT problems may occur 80% of the time and cause only 20% of the pain because IT knows how to deal with them. These issues are best served by good analytics and automation alone. The benefit of AI is to be able to identify and solve complex issues that may occur more rarely – say, 20% of the time – but cause 80% of the pain – without the benefit of an AI to quickly identify and remediate.

Turnaround time is a valuable consideration. What used to take months to diagnose and fix on a large scale can now be done in days, or even hours. For example, if a customer had an issue in Germany based on a specific configuration, how long does it take for the organization to identify the issue? How long to confirm it’s a unique problem, and reactively quantify and identify that issue in other environments? How long to apply fixes proactively, or make recommendations to those environments to mitigate that impact? Finally, and crucially, what is required to avoid the risk in the first place? Using broad-based telemetry AIOps provides a method to accelerate identification and improve recommendation accuracy.

Thinking global, acting local

Using telemetry in this way is a good example of thinking globally and acting locally. You can take all those experiences from customers, using their hardware and provider’s services, and create a broader picture of what’s going on. You can look at what customers are doing and what issues are happening, and then use the data to actually drive a number of these decisions. The provider can then prioritize the risk in its customer base and take targeted action.

The objective of the approach is to get out ahead of problems and give customers insights into potential issues existing within their environments combined with options to avoid these risks. If problems can be preemptively identified, customers can make informed decisions and control risk.

Much of the information unearthed through an AIOps process can help customers address the problems directly. Where issues can be avoided through usage, preemptive recommendations backed by factual reasoning provides IT with mandate required to drive change. If resolution requires product enhancements to address issues, these can be entered into the product lifecycle development to address the issue, or at a minimum enable better identification and prevention.

What does tomorrow look like?

Most of what we discussed here can be considered on a discrete system-by-system basis. You have a server, it sends its telemetry, and it sends back its recommendations. However, business success is no longer tied to monolithic systems. Interoperability between multiple systems, virtualization, applications, and users experience now define IT. To increase agility across the board, analytics need to happen not only at a discrete system level, but also at an IT-environment level. Right across the stack, telemetry is required not only to identify new threats, but also determine best practice. This is where AIOps is increasingly important as it can operate with the speed and scope that a team of engineers could never match.

Applying AIOps to groups of machines and, by extension, groups of systems, and ultimately an entire customer base offers multiple points of perspective. Smart organizations can correlate this data and apply it to the whole concept of interoperability. Separately, moving up the stack and into the application, provides insights into how the application is actually engaging and interfacing with all of the products. This will enable new ways to optimize applications based not only how one customer is using it, but how entire customer bases are using technology globally.

Conclusion

The path to best practice will become better defined. Using fact-based analytics enabled at scale through AI will create an opportunity to build resilience into IT environments. As AIOps continues to mature, the scope of perspective will create reliable “known-good” paths for vendors and customers alike. As for today, improvements in tools and data security now ensure that the benefit that AIOps can provide weighs heavily in favor of streaming machine telemetry data as many IT issues are becoming “optional.” 

Service providers will abstract complexity from the customer and make better recommendations to increase predictability and ease of use. Development of successful AI-based solutions often rely on the collection of data. Service providers that have both access to telemetry data of a wide installed base of products, and the reach of a strong support services organization will have a significant advantage.

Customers can already benefit from being part of a large-scale connected community through predictive AIOps. AIOps has come a long way in five years. Expect it to continue to develop in the years to come.

For more information please visit www.hpe.com/services/operational

____________________________________

About Duncan Goode

duncan goode
Duncan Goode is a worldwide services product manager for HPE Pointnext Services. His goal is to ensure a quality support experience that drives better business outcomes for customers. Duncan has worked in technology and support services for 30 years, providing leadership and innovation in a variety of roles across global support, mission-critical, and retail environments. Based in Australia, he enjoys spending time playing and coaching cricket.

 

About Jordan Lewy

jorden lewy2
Jordan Lewy is a Worldwide Manager for HPE Pointnext Support Services. In this role, his goal is to transform HPE’s customer support experience using HPE InfoSight, which in turn drives their business outcomes and enables their digital transformation journey. Jordan brings to his position a well-established background in information technology and professional services, where he has worked for over 20 years. Prior to taking on his current role he held other positions in HPE including leadership for HPE’s Storage Support services, Installation and Technical services and HPE’s Customer Technical Training business.

Copyright © 2021 IDG Communications, Inc.