Proactive problem management is defined as “activities to detect future problems and incidents”. This statement can have multiple interpretations. In this article, we will interpret proactive problem management through the lens of statistics.
Symptoms and causes
Incidents are the symptoms of an underlying problem. If we focus only on the symptoms, they can keep occurring again. For example, if a person is running high fever, giving them Panadol® will treat the symptom and provide immediate relief. However, if it is a bacterial infection the patient needs to take an anti-biotic to cure the underlying infection. . In the IT Service Management world, administering Panadol® is analogous to resolving the incident. Investigating the symptom and providing a medication to cure the infection is problem management.
In order to understand proactive problem management, we need to understand “process variations”. Let us consider a process that has a single output value.
Figure 1 Constant Output
In the above figure the output is constant. In a natural system we rarely get such a “constant” output. For example, the incident volume will not be constant between two consecutive days. The resolution time of an incident varies from one incident to another. Any natural activity will have its inherent variations.
Figure 2 Output variations
In this chart you can see variations. It is a natural behavior of a process to exhibit variations. One day you may get 500 incidents and the next day it can be 550. Can we say that a “stable process will exhibit variations”?
Maybe! What about if the incident volume jumps to 1000? Do we still call the process, “stable”?
Introducing “common cause” and “special cause”
The answer to that question lies in understanding the types of causes. The “common cause” of variations is responsible for causing the natural pattern of fluctuations. It is possible to reduce the fluctuations, but one cannot completely eliminate the fluctuations. For example, the routine requests like password resets, email issues occur continually. To some extent, the volume and the impacts of these events are predictable.
The special cause of variations is an unnatural pattern. Special Cause of variation is created by a non-random event leading to an unexpected change in the process output. The effects are intermittent and
unpredictable. In our example, if the incident volume jumps to 1000, it could be due to business critical server going down. It is not a predictable event and will not occur frequently.
How do we know a particular data point is caused by a “special cause”?
Introducing Control Chart
That question leads to a tool “control chart”. Let us consider the average resolution time of incidents as an output variable. . The Service Level Agreement (SLA) states that the incidents should be resolved within 5 hours. A typical control chart is shown below:
A control chart shows two categories of “limits” – Control Limits and Specification Limits. The Specification Limits can be derived from the SLA.
In the above example, the service desk performance is within the Service Levels. The Upper Specification Limit (USL) is set at 5 hours. The Upper Control Limit (UCL) is around 4 hours. We see that there is one point above the UCL that can be investigated. It could have been caused by a special cause – for example, unplanned leave of few service desk officers on that particular day.
Control charts show visually the data points that are out of control due to special causes. (Please note there are different ways of identifying “out of control” data points. Refer  and  in the “Further reading” section.)
The problem management activities can be classified based on the type of causes identified through the control chart.
||Causes||Control Chart data|
|Reactive Problem Management
||Special Causes||Out of control data|
|Proactive Problem Management
||Common Causes||In control data|
Table 1Problem Classification
If an organization wants to implement Problem Management process, one of the logical start points is to address the special cause of variations. For example, major incidents are symptoms of special causes. If there are too many major incidents, we know that the operational environment is not stable.
Once the major incidents are under control, the organization can start focusing on proactive problem management. The proactive problem management will focus on the common causes which are more difficult to identify and improve.
Continual Service Improvement
The first step in implementing Continual Service Improvement is to understand the process performance. A control chart is a tool that gives an indication of the process performance.
Let us analyze the chart in Figure 3, and see how it can translate to improvement.
We need to eliminate the “out of control” point. We would initiate a Root Cause Analysis and systematically analyze the different causes that contribute to the “out of control” data. Assuming we can remove the “out of control” data by eliminating the underlying causes, the chart might look like Figure 4 in the future reporting period.
Figure 4 Eliminating Special Causes
Please note that all the data points are within the Upper Control Limit of 4 hours. The service provider can consider revising the Service Levels from 5 hours to 4 hours, which is a tangible improvement for the customer.
The next step is to focus on the common cause of variations. Please note that the average has not changed as a result of eliminating the special cause in Figure 4. When the organization starts to analyze the common causes, the average impact time will be improved. Figure-5 shows the improvement in average time due to reduction of common causes. The average impact time is reduced from 2.5 hours to 2hours. This means that the organization has improved their resolution time consistently.
Figure 5 Reducing Common Causes
Many organizations view that the ultimate goal of a service provider is to meet the Service Levels. Unfortunately this binary, black box approach does not drive any improvement. In order to drive the improvement, the organization needs to understand the process performance and improve the performance through proactive problem management. In this article, we have introduced a new perspective of Proactive Problem Management and a tool to take the first step.