you're reading...

Proactive Problem Management – a new perspective

Proactive problem management is defined as “activities to detect future problems and incidents”. This statement can have multiple interpretations. In this article, we will interpret proactive problem management through the lens of statistics.

Symptoms and causes

Incidents are the symptoms of an underlying problem. If we focus only on the symptoms, they can keep occurring again. For example, if a person is running high fever, giving them Panadol® will treat the symptom and provide immediate relief. However, if it is a bacterial infection the patient needs to take an anti-biotic to cure the underlying infection. . In the IT Service Management world, administering Panadol® is analogous to resolving the incident. Investigating the symptom and providing a medication to cure the infection is problem management.

Understanding variations

In order to understand proactive problem management, we need to understand “process variations”. Let us consider a process that has a single output value.

Figure 1 Constant Output

In the above figure the output is constant. In a natural system we rarely get such a “constant” output. For example, the incident volume will not be constant between two consecutive days. The resolution time of an incident varies from one incident to another. Any natural activity will have its inherent variations.

Figure 2 Output variations

In this chart you can see variations. It is a natural behavior of a process to exhibit variations. One day you may get 500 incidents and the next day it can be 550. Can we say that a “stable process will exhibit variations”?

Maybe! What about if the incident volume jumps to 1000? Do we still call the process, “stable”?

Introducing “common cause” and “special cause”

The answer to that question lies in understanding the types of causes. The “common cause” of variations is responsible for causing the natural pattern of fluctuations. It is possible to reduce the fluctuations, but one cannot completely eliminate the fluctuations. For example, the routine requests like password resets, email issues occur continually. To some extent, the volume and the impacts of these events are predictable.

The special cause of variations is an unnatural pattern. Special Cause of variation is created by a non-random event leading to an unexpected change in the process output. The effects are intermittent and

unpredictable. In our example, if the incident volume jumps to 1000, it could be due to business critical server going down. It is not a predictable event and will not occur frequently.

How do we know a particular data point is caused by a “special cause”?

Introducing Control Chart

That question leads to a tool “control chart”. Let us consider the average resolution time of incidents as an output variable. . The Service Level Agreement (SLA) states that the incidents should be resolved within 5 hours. A typical control chart is shown below:

A control chart shows two categories of “limits” – Control Limits and Specification Limits. The Specification Limits can be derived from the SLA.

In the above example, the service desk performance is within the Service Levels. The Upper Specification Limit (USL) is set at 5 hours. The Upper Control Limit (UCL) is around 4 hours. We see that there is one point above the UCL that can be investigated. It could have been caused by a special cause – for example, unplanned leave of few service desk officers on that particular day.

Control charts show visually the data points that are out of control due to special causes. (Please note there are different ways of identifying “out of control” data points. Refer [1] and [2] in the “Further reading” section.)

Problem Classification

The problem management activities can be classified based on the type of causes identified through the control chart.

Causes Control Chart data
Reactive Problem Management
Special Causes Out of control data
Proactive Problem Management
Common Causes In control data

Table 1Problem Classification

If an organization wants to implement Problem Management process, one of the logical start points is to address the special cause of variations. For example, major incidents are symptoms of special causes. If there are too many major incidents, we know that the operational environment is not stable.

Once the major incidents are under control, the organization can start focusing on proactive problem management. The proactive problem management will focus on the common causes which are more difficult to identify and improve.

Continual Service Improvement

The first step in implementing Continual Service Improvement is to understand the process performance. A control chart is a tool that gives an indication of the process performance.

Let us analyze the chart in Figure 3, and see how it can translate to improvement.

We need to eliminate the “out of control” point. We would initiate a Root Cause Analysis and systematically analyze the different causes that contribute to the “out of control” data. Assuming we can remove the “out of control” data by eliminating the underlying causes, the chart might look like Figure 4 in the future reporting period.

Figure 4 Eliminating Special Causes

Please note that all the data points are within the Upper Control Limit of 4 hours. The service provider can consider revising the Service Levels from 5 hours to 4 hours, which is a tangible improvement for the customer.

The next step is to focus on the common cause of variations. Please note that the average has not changed as a result of eliminating the special cause in Figure 4. When the organization starts to analyze the common causes, the average impact time will be improved. Figure-5 shows the improvement in average time due to reduction of common causes. The average impact time is reduced from 2.5 hours to 2hours. This means that the organization has improved their resolution time consistently.

Figure 5 Reducing Common Causes


Many organizations view that the ultimate goal of a service provider is to meet the Service Levels. Unfortunately this binary, black box approach does not drive any improvement. In order to drive the improvement, the organization needs to understand the process performance and improve the performance through proactive problem management. In this article, we have introduced a new perspective of Proactive Problem Management and a tool to take the first step.

Further Reading:

  1. http://en.wikipedia.org/wiki/Western_Electric_rules
  2. http://asq.org/learn-about-quality/data-collection-analysis-tools/overview/control-chart.html
  3. http://en.wikipedia.org/wiki/Control_chart
  4. http://en.wikipedia.org/wiki/Common-cause_and_special-cause

About Murali Ramakrishnan

Murali is the Managing Director of the boutique consulting firm "Process-Symphony". Process-Symphony specializes in IT enabled business process orchestration. http://www.process-symphony.com.au http://www.kloudax.net.au


No comments yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: