Blog

What Are the Primary ITIL Major Incident Management Roles and Responsibilities?

Posted by on November 26, 2018

Major Incident Management

The ITIL® framework is the leading global standard for IT Service Management (ITSM). Most recently, ITIL has contained 26 separate and distinct processes and four functions that are organized into the five stages of the IT service lifecycle. There are ITIL processes to help organizations strategize about what services they will offer, effectively design services, build and deploy services, operate services, and, finally, to facilitate the continual improvement of services the organization has chosen to deploy.

While we're anxiously awaiting ITIL4, ITIL v3 and the subsequent 2011 version contained five volumes that each correspond to a single phase of the service lifecycle:

Within the Service Operation manual, ITIL organizations can find information about the four functions of ITIL, including the all-important Service Desk that exists to facilitate the Incident Management process. ITIL defines an incident as an unplanned interruption or reduction in quality of an IT services, and all incidents are typically reported to and managed by the IT organization through a service desk.

In this guide, we're focusing in on one of the most important sub-processes of Incident Management: the management of major incidents, or Major Incident Management. We'll explain how they're defined in ITIL and how IT organizations work to resolve them, as well as reviewing the most important ITIL Major Incident Management roles and responsibilities.

What Is Major Incident Management?

The goal of the overall Incident Management process is to effectively manage the lifecycle of all incidents and to restore IT services for users or customers as quickly as possible when an interruption takes place. Incident Management is comprised of nine sub-processes that work together to ensure that Incident Management is conducted efficiently by the IT organization. While our present focus is on Major Incident Management, let's take a look at how these sub-processes work together within the Incident Management process:

Major Incidents challenge Incident Managers to effectively notify and coordinate resources and then deploy them to resolve a problem within an extremely short time frame. While the majority of reported incidents are resolved by 1st- or 2nd-level tech support, major incidents often require additional resources to ensure a timely resolution.

How Does ITIL Qualify a Major Incident?

Based on our examination of the sub-processes that make up Incident Management, we can make some simple inferences about Major Incident Management and how IT organizations handle their highest-priority tickets. We know that incidents are logged and categorized based on their urgency, so IT organizations regularly rely on 1st-level technicians to correctly identify high priority incidents. We also know that incident monitoring and escalation are ongoing processes, so a 1st-level technician has the capacity to escalate issues that can't be resolved on the first call or may require additional resources.

For the IT organization to initiate its Major Incident Management process, there must be some criteria for designating an incident as "major." In fact, the ITIL framework includes an incident priority matrix that Incident Managers can use to organize and prioritize how the IT organization responds to incidents. The incident priority matrix assigns a rating of high, medium, or low to each incident across two separate dimensions: urgency and impact.

High urgency incidents are those for which the damage caused can increase rapidly, or which prevent staff from completing time-sensitive work. Situations where immediate action can prevent a minor incident from becoming a major incident are also considered urgent, as are outages that affect one or more VIP users. Here, the idea of urgency means that the organization can derive significant benefits from addressing the issue sooner rather than later.

Incidents are also assessed for their impact on the organization. A high impact service outage is one that affects a large number of staff and may actually prevent some staff from doing their jobs. High impact incidents have the capacity to cost the company thousands or even tens of thousands of dollars (or more) and the reputation of the business itself could be damaged by the outage.

Ratings of impact and urgency for incidents are used to assign a priority level—commonly between one and five for each incident. Incidents with priority 1 are considered critical—the IT organization aims to respond immediately to such events and rectify them within one hour. In contrast, category 5 incidents are a very low priority—the IT organization will act on them within 24 hours and aim for a resolution within one week. Three level priorities are also common.

Many IT organizations define additional criteria for identifying major incidents and responding appropriately. It is useful to designate certain groups of services, applications, or infrastructure components as business-critical and to trigger the Major Incident Handling process when one of these components becomes unavailable and the estimated time to recover the service is exceedingly long or even unknown.

Major incidents often share the same characteristics as the Category 1 Critical incidents described above. They typically affect a lot of customers at a time, often affect several VIP customers, are costly to customers or to the business organization, and may have the capacity to affect the company's reputation. In addition, major incidents are characterized by the large amount of time and effort that is likely to be required to manage and resolve the incident.

RELATED: ITIL Service Desk Responsibilities

What Is the ITIL Major Incident Process Flow?

ITIL suggests a relatively simple process flow for diagnosing and managing major incidents within the IT organization.

  1. The incident is first reported.
  2. Incident Logging and Categorization takes place—if the incident is a major incident, it will likely be assigned a high rating for both urgency and impact on the organization.
  3. The incident is escalated to 2nd-level support.
  4. The Incident Manager is notified that a major incident has taken place and that technical support staff believe it is a major incident
  5. The incident manager forms a Major Incident Team (MIT, made up of IT managers and technical experts, many from within the company but some potentially from outside. The team will work together to resolve the incident as quickly as possible.
  6. Once a workaround is discovered, the incident may be reported to problem management for future investigation and to develop a permanent solution.
  7. Data is captured from the Major Incident Management process and used to drive continuous improvement throughout the organization's Incident Management practices.

This simple process flow helps to ensure that major incidents are diagnosed early, escalated quickly to the top of the IT organizational chart, and acted on to ensure a prompt resolution. For this to happen, it it important that 1st-level technical staff diagnose and escalate major incidents quickly and don't waste valuable time trying to resolve large and complex incidents themselves.

In a major incident, service level breaches are highly probable. IT organizations must demonstrate their ability to efficiently resolve major incidents and maintain service level agreements.

What Are the ITIL Major Incident Management Roles and Responsibilities?

Under ITIL, four separate roles are allocated accountability and responsibility during the major incident handling process. Below, we detail the ITIL Major Incident Management roles and responsibilities associated with each of these job titles.

Role of 1st-Level Technical Support

First-level support technicians are the primary contact person for incident reports within the IT organization. Typically, they staff the IT Service Desk, taking incident reports from users and customers, registering and categorizing them, and undertaking an immediate effort to restore the service outage as quickly as possible.

When 1st-level support cannot rectify a service outage within an acceptable time frame, the incident is escalated to expert technical support groups (2nd-level support). First-level support technicians may be responsible for doing the actual work of restoring an IT service when a major incident occurs, but they aren't the ones responsible for coordinating the major incident team.

Role of an Incident Manager

The Incident Manager takes full ownership and accountability for the Incident Management process within the IT organization, including all major incidents that are reported and must be resolved. Once a major incident is escalated by 1st- or 2nd-level technical staff, the Incident Manager should determine what resources and expertise are required to resolve the incident and set about forming a Major Incident Team that can resolve the issue as quickly as possible.

Role of a Major Incident Team

The role of the MIT in addressing major IT outages is to restore service as quickly as possible using all available resources. The size and composition of the team will depend on the magnitude and nature of the service outage and the specific expertise and action steps required to restore service.

The team can include IT managers from other departments outside the Service Desk, including staff normally responsible for other processes like Change Management. In addition, 1st- and 2nd-level technical support staff, IT operators within the organization, and even third-party technical specialists from outside the company are typically involved. Together, the team develops and implements a strategy to restore services as quickly as possible.

Role of an IT Operator

IT operators perform daily operational activities within the IT organization, such as installing equipment in the data center, backing up data and maintaining servers, and ensuring that scheduled tasks are performed. IT operators are valued for their familiarity with the company's IT infrastructure and operations, and they may be used as a source of extra labor when the Incident Manager forms a Major Incident Team to address a major service outage.

RELATED: What Is ITIL-Based Service Management and Why Is It Relevant Today?

ITSM Software an Asset for Major Incident Management

IT organizations can increase their efficiency of service delivery by adopting a software-based ITSM solution that supports ITIL best practices. Cherwell's IT Service Management tool suite offers a robust IT service desk that supports compliance with ITIL throughout the Major Incident Handling sub-process of Incident Management. With full support for Incident and Request Management through an intuitive service portal, your IT organization will be able to receive, categorize, and quickly resolve Major Incidents. Cherwell ITSM software is frequently complemented by specialized alerting and coordination products for managing organizational aspects of major incidents.

Additionally, security incidents are often categorized as major incidents, especially if they may pose a financial threat to the organization or threaten its reputation. Cherwell's Information Security Management System (ISMS) tool suite offers an added layer of protection that promises to minimize the impact of security events and improve incident response. Features like automated assessments help your organization anticipate and mitigate risk while managing security and Service Desk through an integrated dashboard.

Get started today by requesting a product demo and see what Cherwell can do for your organization.

Request Your Demo