When Does an Incident Turn Into a Problem?
Posted by on February 20, 2019
Matt Klassen is the vice president of product marketing at Cherwell. He is passionate about enabling enterprises to accelerate their digital journey through better software and better service. Matt has 25 years experience in developing, architecting, selling, and marketing enterprise software solutions for IT and product teams.
When organizations start looking at compliance with the ITIL® best practices for IT service management, the most common first steps are the establishment of a service desk and processes for Incident Management and request fulfillment. A service desk enables a single point of contact between the IT organization and the business, and the associated processes serve to minimize business interruptions and restore service outages with a demonstrable ROI.
As IT organizations grow, they begin to implement additional processes that are supported by the service desk model. One process in particular, known as Problem Management, plays a key role in allowing the IT organization to diagnose and address the underlying causes of reported incidents and proactively reducing the number of incident tickets received by the IT organization.
Problems are often discovered by IT operators involved in daily Incident Management, making the process itself an important contributor to successful Problem Management efforts—but when does an incident turn into a problem? What incidents should be reported to problem management—and why?
To answer these questions, let's take a closer look at the Problem Management process itself and the connection between incident management and problem management in ITIL.
What Is an Incident?
In ITIL, an incident is described as a IT service outage that causes a business interruption. A reduction in the quality of an IT service could also be considered an incident under ITIL. The goal of the ITIL Incident Management process is to restore service levels as quickly as possible following an incident, thus minimizing any negative impacts to the business.
IT organizations use a service desk as a central point of contact for incident management. If an incident occurs that causes a service interruption for a user or customer, the customer can contact the IT organization and report the incident using the service desk. Incidents that have been reported are logged and categorized according to their urgency, severity, and type, and the IT operator attempts to restore the service as quickly as possible.
It should be noted that the service desk also manages request fulfillment. Requests from customers do not reflect a service outage, and are therefore not typically associated with Problem Management.
RELATED: What Is Incident Management?
What Is a Problem?
A problem is the unknown cause of one or more incidents. While incidents must be resolved as quickly as possible once reported, problems require analysis and root cause analysis and may not constitute an emergency. A problem may not be causing a business interruption at any given moment, but if left unresolved, a problem will continue to precipitate incidents and business interruptions until it can be resolved.
To understand the significance of a problem in ITIL, it is useful to explore the notion of cause and effect. When it comes to diagnosing and understanding service outages, problems are the cause and incidents are the effect.
To say it differently, incidents are the "what" and problems are the "why."
If an employee reports the inability to access an application, that's a business interruption that should be resolved through the Incident Management process. If the application server is crashing all the time and generating multiple incident reports, you may want to investigate the underlying cause—what is happening to make the server crash? Once you're investigating the causes of incidents rather than simply trying to restore service via a workaround or the most convenient method, you're into Problem Management territory.
How Are Incidents Escalated to Problems?
Imagine that you work as an auto mechanic servicing a fleet of rental vehicles. One day, the customer comes into your shop and reports that the air conditioner in their vehicle is malfunctioning. It’s the middle of summer, so it needs to be repaired before the vehicle can go back into service. The cooling failure is an incident—a service interruption that prevents the operator from using the vehicle until it can be repaired. Throughout the next month, more vehicles come in from the rental lot with air conditioner problems. Eventually, you realize there’s a problem—an underlying cause of air conditioning malfunctions within the fleet. You report the incidents and later find that a recall was issued on malfunctioning fan belts for vehicles in the rental fleet. Your company addresses the problem and reduces incidences of cooling failure in time for next summer.
Similarly, an IT operator should have the capacity to report incidents to Problem Management when they occur multiple times across the organization under similar conditions. If an incident is reported many times, the organization can realize gains in efficiency by resolving the underlying problem that leads to these incidents. The IT organization must conduct a root cause analysis to determine the source of the problem before assessing whether it can be fixed and whether it is cost-effective to fix it.
Some incidents seem to repeat themselves despite successful troubleshooting. An IT operator may resolve an issue for a customer after a short troubleshooting session, only for the same issue to present itself the following day, sometimes for the same customer and sometimes for someone else. Intermittent errors—those that seem to appear sporadically or unpredictably—can pose a significant challenge even for experienced IT operators. There is often an underlying cause of these errors that can be diagnosed and addressed via the Problem Management process.
In addition to incident and problem reporting by IT operators, there are new methods available for IT organizations that wish to leverage technology as part of their Problem Management strategy. Some IT organizations are using artificial intelligence (AI) programs to analyze incident records, searching for patterns and trends that could indicate or even diagnose a problem. Rather than relying on human intuition to precipitate reports to Problem Management, these organizations have built a service strategy that minimizes service interruptions by proactively detecting and diagnosing problems.
AI can also be used to detect an automatically report incidents, potentially allowing IT operators to resolve them before a service outage affects business revenue.
While many problems are discovered by IT operators during Incident Management activities, there are additional ITIL processes that can feed into Problem Management. For example, IT organizations use the Demand Management and Capacity Management processes to ensure that services are available during times of peak demand. Many incident requests are connected to poor service availability and could be addressed by proactively reporting capacity issues to Problem Management.
What Happens in the Problem Management Process?
As we've seen, there are many ways that problems can be reported to the Problem Management process. Now let's look at how IT organizations diagnose problems using the Problem Management process of ITIL. The goal of this process is to manage the lifecycle of all problems, to prevent incidents from happening through proactive management of problems, and to minimize the impact of incidents that cannot be prevented from occurring.
The Problem Management process is comprised of seven sub-processes listed below:
Proactive Problem Identification - Organizations engage in proactive problem identification to identify and solve problems or find suitable workarounds to known problems before an incident is created which leads to a business interruption. Some problems may be easy and inexpensive to fix, while the organization can establish workarounds for problems that can't readily be addressed.
Problem Categorization and Prioritization - Problems that are discovered must be recorded and prioritized by an application analyst or technical analyst. The report includes information like the suspected root cause of the problem, what incidents have been caused by the problem, and the urgency needed in finding a resolution. Effective categorization ensures that problems are resolved while effectively maintaining service levels across the IT organization.
Problem Diagnosis and Resolution - Diagnosis and resolution is the core process of problem management—here is where technical analysts uses their investigative skills to isolate the root cause of a problem and resolve it. Analysts must conduct a root cause analysis before determining and initiating the most economically appropriate solution—whether it be fixing the problem itself or establishing a workaround to minimize the impact of future incidents.
Problem and Error Control - Problem and error control is a monitoring process whereby the IT organization keeps track of outstanding problems and their progression through the Problem Management sub-processes. This monitoring ensures that corrective actions can be taken if the delayed resolution of a problem will have a negative impact on service levels.
Problem Closure and Evaluation - Once a problem has been diagnosed and resolved, a full Problem Record must be created that contains a complete historical description from the time it was first reported through to resolution. Problem Records can be fed into the Knowledge Management process, as they contain valuable data that your organization can use to drive continual service improvement. In addition, Problem Records are used to update the IT organization's Known Error Database (KED), which lists all known errors that have impacted services.
Major Problem Review - When a major problem is resolved, IT analysts and managers may conduct a Major Problem Review to ensure that the problem does not recur and that important lessons are learned for the future. In addition, a review ensures that problems marked as resolved have actually been eliminated and will not precipitate further incidents in the future.
Problem Management Reporting - This process helps to ensure that other service management staff and processes, as well as IT managers, are informed and aware of current outstanding problems, their processing status, and of any existing workarounds. This process ensures that IT operators who run the service desk have access to the most up-to-date information that can be used to resolve incidents for customers.
IT Service Management Software Facilitates the Problem Management Process
IT Service Management software provides a distinct competitive advantage for IT organizations that wish to bring their service management processes in line with the best practices recommended in ITIL. Cherwell's IT Service Management Software is the ideal service desk solution, offering out-of-the-box compliance with 11 of ITIL's core processes, including Incident Management and Request Fulfillment (core processes of the service desk), Problem Management, Knowledge Management, and more.
Cherwell's user-friendly dashboarding makes it easy for IT operators to visually track and manage key sub-processes in Incident Management, such as recording, classification, and investigation. Operators can easily log information about incidents and track them throughout their lifecycle thanks to Cherwell's automated ticketing system.
Cherwell ITSM also offers a comprehensive Problem Management toolbox where IT analysts can track the status of open problems, filter open problems by category, organize and prioritize problems for resolution, and link existing problems to reported incidents and their known workarounds. Analysts can even automate the Problem Management reporting process, sending problem notifications and updates to users with the touch of a button.
Cherwell's IT Asset Management software can also play a significant role in streamlining the Problem Management process for IT organizations. The application enables IT operators to track configuration items (CIs) for enterprise IT assets, both software and hardware. Increased visibility of IT assets ensures that IT analysts can readily discover information about assets that are linked to known problems or errors.
When does an incident turn into a problem?
To recap, there are several cases where an IT operator might choose to report an incident to the IT organization's Problem Management process owner:
- When an incident continues to reoccur despite multiple successful attempts at troubleshooting
- When the same or similar incidents occur many times across similar conditions and are continuously being reported by customers
- When an incident cannot be resolved due to a suspected underlying issue
Problems can also be reported as a result of proactive investigation. This investigation could be conducted by an AI that analyzes incident management records, by IT operators involved in Demand Management or Capacity Management processes, or by a Knowledge Manager.
Problem Management is an important aspect of effective IT service management, enabling IT organizations to proactively limit the number of reported incidents and get to the root cause of even the most troublesome IT issues.
Interested in what Cherwell can do for your organization? Request a demo today!
Ebook 7 min
7 Deadly Sins of ITIL Implementation
Wondering whether ITIL® is still relevant in today's fast-paced digital environment? ITIL holds many timeless truths, but it can be misapplied when taken too literally. Uncover the seven mistakes commonly made with ITIL implementations, and gain guidance on how you can go faster—while still upholding ITIL's key principles.
White Paper 7 min
ITIL Made Easy: Best Processes and Best Practices
How do you simplify and remain ITIL compliant in today's increasingly dynamic and fast-paced business environment? This white paper details the step-by-step process to achieving ITIL success using the "Golden Triangle" (People, Process, Technology) framework as your guide.
You might also be interested in
Blog 10 min
What Is the ITIL Change Management Process?
Get to know the principles and objective of this vital ITIL process, along with the steps involved in the process.
Blog 11 min
What Is Incident Management?
See why this ITIL process is typically the first one adopted by organizations to help minimize the effects of unplanned service interruptions.