AIOps Data, Decisions, and Automation Best Practices

Blog

August 15, 2022 By Lee Koepping ScienceLogic By Sean Applegate

This is the second of a three-part AIOps discussion with ScienceLogic’s Public Sector Principal Architect, Lee Koepping, and Swish’s CTO, Sean Applegate. The three parts are:

Insights into the Value of AIOPs
AIOps Data, Decisions and Automation Best Practices

Getting Started with AIOps

Sean:

I’m happy to welcome back Lee Koepping, the Public Sector Principal Solutions Architect for ScienceLogic. Lee is going to share his experiences and lessons learned in respect to understanding AIOps data, decisions we can drive with that data, and most importantly the automations we can implement so we can save our employees time and effort.

IT data is critical for understanding context using AIOps. Help us dive into the various types of data and how they complement each other.

Lee:

Data is king and in AIOps it’s one of the foundational requirements by any measure. Taking data that is performance focused, your typical speeds and feeds (which are not new), as well as log data, availability data, relationship data (which is often huge), and then different perspectives or variances of that same data. So tactically it’s, taking the information of structure from your environment (topology)and combining it with performance, and logging, which could have security context as well, and then evaluating that data against whether it’s thresholds or logic, and also using machine learning to detect anomalies and produce an actionable event. That’s really what it’s all about.

Sean:

That’s great. At Swish we often interact with different teams around observability and are able to connect different data sets together to provide significant value. Gartner recently published their Market Guide for AIOps Platforms which speaks to the value of different AIOps solutions, one category being AIOps platforms. Often using ScienceLogic as the monitor of monitors helps us further leverage the investments customers have made in other tools. It allows the data to be collectively pulled together for correlation, causation, and automation across domains of expertise. This a unique value ScienceLogic provides to clients.

Lee:

That’s a key tenant of AIOps; it isn’t just a tool. It’s how it’s implemented, how it’s used, and where it fits in a larger ecosystem. As you know, at ScienceLogic we enjoy working with Swish because you have

expertise in application performance management tools, which are very niche, purpose-built and yield a certain type of data, in addition to your knowledge of traditional ITOM. We overlap a lot into that space of direct acquisition of data. But also the ITSM side of things and workflow management which is a different data set often driven by the other two. We see ourselves as sort of a hub at the middle, to your point, often a manager of managers to coin a phrase.

Sean:

Now that we’ve talked a bit about the data, please walk us through the types of contextual insights and detailed analytics that ScienceLogic provides for various user types. For the audience, let’s go from the top down. From business services to potentially devices.

Lee:

Well, what’s really key is to get to business services. This is where things are personable and unique to each enterprise, and it takes more of a consultancy approach. Something that Swish, for example, can bring to the table as a partner. As a manufacturer, ScienceLogic creates this great AIOps platform with the ability to configure these things. Much of our data’s auto discovered and auto derived. There’re a lot of things that happen automatically and a lot of patents behind that at the end of the day. To truly be effective, an organization must visualize that data in a way that is intuitive to the organization and familiar to different people within the organization. Business services is how we do that. In Public Sector, we tend to call them mission-aligned services, but basically every organization uses business services differently.

A lot of DoD clients look at things in terms of Area of Responsibility (AOR) or command structure. Civilian agencies look at things more from an application or service perspective; often multiple applications make up a particular business service. We see that a lot at Veterans Affairs (VA) where a system that veterans interact with is really the culmination of three, four, or five applications which are loosely coupled. Each of them independent, but all of them work together. They might represent a single business service based on dependent sub services. So that visualization is key, because it makes it easy to navigate and to understand where issues are in the environment as well as rapidly dive into troubleshooting.

The defined data that we’re ingesting adds additional context which helps the machine learning and algorithms understand those soft relationships. Intuitively, what box speaks to what box and what can you derive from that, but also how those come together to support a certain mission or business function is critical to be able to drive additional correlation factors at the transactional level between

low-level components. One of the biggest benefits of adopting an AIOps approach is understanding structure, because it drives so much insight and understanding. The process in coming up with it drives an equal amount of value, challenging an organization to think critically about how they work together within the different support teams because that’s the only way you’re going to come up with that structure. It’s helpful to understand all those different aspects.

Sean:

Understanding those aspects is the first phase in a continuous improvement journey where we can leverage AIOps. Understanding the ticketing, the data behind the ticketing, how to better troubleshoot things and glue those different data points together, and then contextualize that visually with the business services or more digital dashboards for end-to-end troubleshooting is great. As continuous improvement junkies, or lazy intelligent engineers, we always want to look for a way to streamline our operations. Often asking ourselves, how do I do it faster? With less work? How do I make it easier for the lower skilled engineers to do the work? How do I remove the work altogether and completely automate it? In a perfect world, we want to automate everything, but that isn’t realistic.

What type of automations are available from ScienceLogic and how might a client integrate these into a continuous improvement effort?

Lee:

There are three main areas of automation that we focus on. The most mature is driving data from an ITOM perspective, which is largely where we live in ITSM. Taking events and making those incidents and also populating a CMDB. This is where people probably struggle the most in the ITSM realm. They’re good at understanding their workflow and they get advanced workflow management systems that, factor in incidents, problems, and knowledge management, but all of that is foundationally based on the configuration management database (CMDB). So, when there is an event or an incident, what is that on? Andwhat does that effect? Having a very accurate CMDB is key. One realm is populating the CMDB and sort of automating the incidents. That’s where most people start that AIOps journey, being able to streamline that process.

Next is probably triage. A lot of people are fearful of a system taking over the environment. But what they do want is to eliminate the redundant or often time-consuming troubleshooting. We call that event enrichment. When an event triggers a series of things, and we have a lot of it prebuilt into the system but more importantly we allow customers to define those and take a low code /no code approach to easily allow additional enrichment.

For example, anyone who’s had the pleasure of calling Cisco tech support for a network issue, knows you’re going to get hit with a number of questions right off the bat. Have you run this command?

Could you run that command? Can you pull this log, check this setting, et cetera depending on the event. Those are all things that could be automated. The moment the event occurs, go ahead and run five, six, seven, or eight different CLI commands, pull that data back, cleanse the data, put it into the incident ticket. These are steps that a human could spend upwards of 20 or 30 minutes doing if they have the access and knowledge to do it. Having all those things done nearly instantaneously is a huge automation tenant within the ScienceLogic AIOps platform.

And lastly, this is a bit of the utopian vision, is self-healing. For example, when X happens, restart the service. Doing something very interactive, when X happens tells a provisioning system to spin up an additional instance. For example, load balance things when a service migrates from host Y to host Z, automatically changes a DNS setting, or something like that.

It’s IT automation in terms of engaging support and tracking support. It’s event enrichment and being able to do repetitive mundane things but do them consistently and accurately. Lastly it is to do the self-healing aspects of it and all those things can work together.

To summarize, a ticket is opened and enrichments are run stamped into the ticket. Change is made based on that event and a resolved ticket is closed without a human having touched it. Clients get the benefit of restoration of service to many users, but they also get a complete audit trail in their ITSM system, which is often used to go drive other decisions like lessons learned, postmortem discussions, and allocation of resources.

Sean:

Humans are assigned to a ticket. I’ve seen that a few times over the years. Often very late at night and not always a lot of fun, often in high stress situations. If we can eliminate the war room, that’s always appreciated. If I were a federal organization and I wanted to take the first step of learning more, or seeing this in action and not just hearing about it, where can I find things like demo videos for ScienceLogic?

Lee:

We have a YouTube channel that’s emerging with a lot of great content from over the years. More recently we’ve been focusing on use cases, such as Business Service Management, IT Workflow Automation, AIOps Intelligence for IT Operations, and Modernizing and Rationalizing legacy IT monitoring tools. Our website has a lot of those videos as well as numerous white papers. Many aren’t product specific and provide an education on AIOps in general.

Eventually engage a trusted advisor partner, like Swish, to be able to start the dialogue and get an unbiased opinion. Your engineers at Swish are very adept in the public sector space and you represent a lot of partnerships with a lot of best-of-breed tools. When I was a customer, that’s where I would start, and then I would work my way back to the manufacturer. Do a little independent research along the way.

Sean:

Very much appreciate the plug for Swish. We take pride in honesty and integrity in our recommendations for clients. As you know, it doesn’t always make all the various OEM sales reps happy but representing the customer’s best interest is important to Swish. For AIOps it’s not a single tool, it’s an architecture, it’s a journey. Sometimes we must adapt to what a client’s capability sets are, or their teams’ abilities, initiatives, and budget limitations.

Lee:

Those are all key factors. If one tool did it all, we’d all work there. So, I think it’s really being able to look at a tool’s use within a best-of-breed area in a broader ecosystem. That really starts with somebody like Swish, who has a broader view of the market. We do a lot of great things within the SL1 platform, but I wouldn’t say that our platform should be used everywhere. It comes down to maturity, complexity, size, scale – all those different factors.

Sean:

It is tricky. Let’s close out today’s discussion with Lee Koepping from ScienceLogic. Thank you for joining us to talk about AIOps data, decisions, and automation best practices. Our next blog post is focused on Getting Started with AIOps. If you like what you’ve read so far, go see article three and you can jump into getting started.

Resources:

Customer Stories

Author

Lee Koepping ScienceLogic

Author

Sean Applegate

Chief Technology Officer

Sean Applegate serves as Chief Technology Officer for Swish, where he leads innovation strategy, solutions, and services. Sean is passionate about delivering life cycle services to ensure clients realize maximum value for technology investments.