How to Improve Incident Management with a DDM-Powered CMDB

Posted by on February 09, 2021

Reducing the number of major incidents, improving the mean time to recover (MTTR), identifying problem root causes, and systematically improving employee experiences are top priorities for IT operational leaders.

But trying to address incidents without schematics connecting business services, applications, and infrastructure is like trying to find your way out of a forest without a map or compass. For IT operations, this schematic should include an accurate and up-to-date CMDB showing critical systems, application mappings, and service definitions.

CMDB’s are notoriously inaccurate, but with an auto-discovery and dependency mapping (DDM) capability, the CMDB is a critical information source and tool for IT operations. It can help IT Ops reduce the number of incidents, resolve them faster, find root causes, and capture service-level metrics to justify prioritizing investments.

Improving Incident Management Was Never Easy

There’s little debate among CIOs and IT leaders that improving operational KPIs and metrics are table stakes to running responsible and credible IT organizations. It’s vital today because businesses rely on IT systems for mission-critical workflows, analytics, and customer-facing experiences.

Now there’s little debate around the many factors outside IT’s control impact the reliability and performance of applications and systems. But how quickly, efficiently, and accurately IT resolves incidents and addresses problem root causes is considered a critical responsibility of capable IT leaders.

While these responsibilities are critical for organizations investing in digital transformation, incident managers, heads of IT operations, and CIOs will confide that improving incident management processes and KPIs are not easy.

For one thing, system and application architectures are more complex today than ever before. Modernized applications interface with microservices, integrate with multiple third-party SaaS platforms, and process data from many data services. They run on public clouds, private clouds, and edge computing infrastructures. When an incident occurs, identifying what system is having a problem takes time to diagnose, and chasing too many false positives can lead to longer recovery efforts.

Legacy systems, monolithic applications, and chatty services have their own challenges, especially since they are often dependencies to the primary business processes.

Resolving incidents quickly and efficiently requires a quick diagnosis and prescriptive actions since one issue may create a cascade of problems that need fixing. For example, if a database has a failing file system, it might corrupt database indexes and slow down applications. IT Ops are often in a situation where restoring business services requires addressing multiple issues.

The challenge is that resolving incidents faster and accurately requires better documentation and collaboration with subject matter experts, including application developers, system engineers, and architects. Now when there is a major incident, incident managers often get the support needed to resolve issues and restore service.

But in general, it’s hard for incident managers to obtain ongoing collaboration from other IT teams to solve repetitive issues or review processes to improve incident resolutions. Also, addressing root causes requires investment to modernize applications and architectures, but it’s challenging to make the business case to prioritize operationally-driven improvements.

DDM Automates Capturing the Current State of Cloud Infrastructure

A DDM backed CMDB is a game-changer for incident management teams because it closes the knowledge gap between support teams and subject matter experts while providing up-to-date information around business services.

Here’s how DDM works. An agentless DDM runs on a schedule and scans the network for configuration information on the systems, storage, networks, applications, services, and databases running in private and public clouds. It then updates the CMDB with the current and accurate, including changes driven by a cloud’s elastic computing capabilities or DevOps automations such as CI/CD and IaC. IT Ops can then use tools to define business services and identify the underlying system dependencies.

The DDM is not just an automated data collector on application and system configurations. The DDM discovers the relationships between web servers, application services, multiple API services, and database transactions. Topology maps illustrate the relationships between different system components and are diagnostic tools that IT Ops can use to understand the root cause of incidents.

So, the next time one or more systems generate alerts, the incident managers have a lot more information at their fingertips.

A DDM Backed CMDB Helps Incident Managers Find Root Causes

Let’s consider a simple example of multiple alerts from a three-tiered web application running with Apache web servers, Tomcat web servers, and a Postgres database on AWS. The incident manager sees warnings coming from Tomcat and the Postgres databases, and several employees have opened tickets escalating slow performance and errors from the application.

A knee-jerk response to this problem might be to restart Tomcat and clear database connections, but this may not be the correct course of action. With a DDM enabled CMDB, the incident manager and IT Ops now have several new tools to review.

  • A DDM topological view showing the systems sending out alerts
  • A CMDB’s view showing the impacted business services
  • The ITSM changelogs to help determine if a change caused the incident
  • As IT attempts to remediate the issue, the incident manager can validate the application’s performance and flows

In this incident, IT Ops uses the DDM’s complex topology maps to see that the Postgres database has a client causing a long-running database admin job. Restarting Tomcat or shutting down services would not have addressed the issue. Instead, the correct action is to pause the database admin job and resume it during offpeak hours.

The key here is that the incident manager directed the correct action and quickly deduced the issue using the DDM’s flow maps. If IT Ops followed a prescriptive playbook and restarted the server, they may have interrupted major business services.

DDM + CMDB + ITSM -> Data and Analytics to Drive Operational Changes

Solving incidents faster and more accurately is one operational benefit. But even most important, IT now has a system of record that associates incidents with the underlying systems. IT leaders can then present the analytics around which business services and applications generate the most incidents or the incidents with the lengthiest outages.

That report is a critical part of the call-to-action IT Ops leaders often seek to influence priorities and investment in modernizing applications and upgrading infrastructure.

The key is IT Ops having up-to-date and accurate information in the CMDB and using a DDM’s automation to capture dependencies. Connecting ITIL processes, especially incident management, enables IT Ops to improve operational KPIs and employee experiences. The added context relating incidents with business services can help drive longer-term improvements and investments.

For organizations looking to improve employee experiences, integrating a DDM powered CMDB provides IT Ops contextual data and a versatile tool for resolving incidents faster and accurately.

Isaac Sacolick, President of StarCIO, guides companies through smarter, faster, innovative, and safer digital transformation programs that deliver business results. He is the author of the Amazon bestseller, Driving Digital: The Leader’s Guide to Business Transformation through Technology, industry speaker, and blogger at Social, Agile, and Transformation.

Learn More About Our Updates to the Cherwell Knowledge Articles