A robust root cause analysis (RCA) process is an essential tool for any effective reliability or maintenance improvement program. As Phil Burge, Country Communication Manager at SKF, explains, it can provide useful insights for machine designers too.
When a machine fails in service, operators and maintenance teams have two choices: they can fix the immediate symptoms, start the equipment running again and cross their fingers; or they can take a closer look at the problem, try to understand the underlying issues that led to its occurrence, and take steps to ensure they are not repeated. With increasing pressure to improve equipment reliability, eliminate unplanned downtime and reduce maintenance costs, the latter approach is clearly the preferred option, but true root cause analysis (RCA) is notoriously complex, time-consuming and data-hungry. To make RCA part of their standard continuous improvement practices, companies need to master the right skills, process and tools.
Root cause analysis is based on the theory that every failure stems from three causes: physical or technical causes; human causes such as errors of omission or commission; and latent or organisational causes that stem from the organisation's systems, operating procedures and decision-making processes. To identify those causes, the RCA procedure includes seven basic elements: problem identification; problem understanding; data collection; root cause identification; root cause elimination; monitoring; and evaluation. Let us look at each in turn.
1. Identifying the problem
If a problem is perceived as normal, it never gets fixed. So the first step in finding the root cause of an issue is to give it a name. Problems do not need to be as obvious as a sudden, unplanned stoppage. A company might be equally interested in understanding the cause of a quality issue, excessive energy consumption, or simply in the failure of their equipment or process to perform in line with wider industry benchmarks.
2. Understanding the problem
Context is critical in RCA. While the problem under investigation might manifest as a failure in a single machine component or process variable, the underlying cause might lie elsewhere entirely. Therefore the investigating team needs to ensure it has a full understanding of the equipment or process under investigation. Graphical tools such as process flow diagrams or spider charts can be used to help visualise the system, aiding the identification and discussion of probable causes.
3. Data collection
Robust RCA is built on fact-based decision-making, so teams need the relevant data at their fingertips. That data can come from a variety of sources. If a problem is repeated, or intermittent, they will want to look at the operating conditions leading up to the failure: had the equipment recently been maintained, for example, or was it running in a particular configuration or with particular operators? Graphical techniques, from histograms to scatter plots, can help companies explore the data they collect to identify factors associated with the issue in question. Inspection of faulty components can reveal a lot about the underlying causes of failure. Bearings, for example, can exhibit surface markings that can be a tell-tale indicator that wear was caused by failure in lubrication, stray electrical currents, or problems with installation.
4. Root cause identification
Multiple causes can lead to the same effect, and identifying the most likely causes, or combination of causes, is a key goal of RCA. Here again, companies can make use of a range of methods, from the simple to the highly sophisticated. The 'five whys' technique is a surprisingly powerful approach, allowing a team to quickly move back through a problem from end symptom to underlying issue. For more complex problems, the fishbone or Ishikawa diagram is a powerful way of graphically connecting multiple possible causes to a single end effect. For really complex systems, companies are increasingly adopting advanced tools, such as statistical regression, Bayesian networks or artificial neural networks to find the most likely causes for the issues they observe. SKF, for example, uses a Bayesian network to support bearing-failure or damage investigations.
5. Root cause elimination
Once they understand the most likely root cause of an issue, companies can put steps in place to eliminate it. Typically, this will involve multiple actions across technical, human and organisational aspects of their processes. If insufficient lubrication was the root cause of a bearing failure, for example, a sustainable fix may include changes in operator training, maintenance and inspection procedures to prevent subsequent failures, together with appropriate oversight to ensure the agreed procedures are followed.
With its mitigation measures in place, the RCA team needs to monitor the situation to ensure its solution is the right one. Ideally, they do not want to wait for another failure to prove them wrong. Usually, however, the improved understanding of the issue provided by the RCA creates opportunities for early identification of the conditions that ultimately led to the failure. The installation of condition monitoring equipment, such as vibration or temperature sensors, can allow the first signs of wear or misalignment in rotating equipment to be spotted, for example.
The final – and sometimes ignored – step in the RCA process is to ensure that the organisation is applying the lessons learned appropriately, and this is an issue for management as much as for the team involved directly in the RCA, since it involves looking at business issues as well technical ones. Should the same mitigation approach be applied elsewhere? Does the reduced cost of failures outweigh the additional cost of the chosen mitigation measures?
Lessons for machine designers
This is also the point where machine designers have the chance to learn. Could modifications to existing or future machines prevent a recurrence of the same issue? This can often be the case even where a failure was caused by errors in operation or maintenance, as the potential for such issues can often be reduced through design changes – for example, by facilitating easier access for inspection or maintenance, simplifying operating procedures or equipping machines with improved self-monitoring capabilities.
Go to www.skf.co.uk to find out more.