How Root Cause Analysis can help address reliability issues (and beyond) at their source
By Tony Freeman, Principal Analyst, Risk Analysis & Mitigation
Whether you’re dealing with a missed cyber security patch, a breaker tripping offline, or even an issue within your own home, inevitably there are always new problems that come up that need to be corrected. But sometimes jumping in to fix a problem doesn’t actually address the “root” of the issue, whether you are working in the electric industry or any other business or home setting. Determining the source of an issue through a root cause analysis is a simple methodology that can provide some great benefits, including improving electric grid reliability!
• Improved reliability: this analysis can help improve the reliability of equipment by addressing the root causes of failures.
• Reduced downtime: it can help reduce downtime by preventing machinery from failing.
• Increased efficiency: it can help improve efficiency by streamlining processes and eliminating inefficiencies.
• Cost savings: it can help reduce costs by preventing future failures and reducing the need for repairs and maintenance.
• Better safety: it can help improve workplace safety by identifying and resolving the root causes of safety incidents.
• Better decision making: it can help improve decision making by providing data-driven insights.
• Continuous improvement: it can help ensure that lessons are learned from failures and acted on.
What is a root cause analysis?
- A root cause analysis is simply a systematic process for determining the cause of a problem/incident. This helps us to determine the main cause of an issue in order to optimize a solution to correct or mitigate it properly rather than continuously treating only the symptoms. The following is one way to perform a root cause analysis. Define the problem:
a. What is going on/what is the problem?
b. Are there symptoms? If so, what are they? - Collect available information/data.
a. How long has the problem persisted?
b. How is the problem affecting the atmosphere/systems around it?
c. Do we have evidence outlining the problem? - Identify causal factors.
a. What circumstance(s) have led to the issue at hand? Are there multiple causal factors and multiple root causes per causal factor? - What is the real reason that led to the problem occurring (identifying why)? Identify granularity through the usage of the “five whys”:
a. Ask “why” in repetition five times (do not go beyond five times as this could cause scenarios to dive into unrealistic scenarios and circumstances).
b. Example problem: Patches were not installed in proper time in March.
i. Why weren’t the patches from March installed? Because we did not know there were patches available.
ii. Why didn’t you know there were patches available? Because we were not reviewing emails telling us the vendor patch/update location had changed.
iii. Why weren’t you reviewing emails and notifications from the vendor? Because we were not receiving any vendor emails.
iv. Why weren’t you receiving vendor emails? Because vendor documentation and notifications were being received via email and those emails were sent to our spam folder.
v. Why were vendor emails and notifications being sent to spam folder? Because new, more restrictive security policies were put in place directing the emails to the spam folder.
c. Potential root cause mitigation: Have a process/procedure for communicating changes to applicable personnel. When new security policies are implemented, ensure that departments are properly notified that they will need to check their spam folders. Direct staff to contact the IT department if important emails/contacts are being routed to the spam folder so they can make sure to receive them in their inbox. - Identify, recommend, and implement mitigations and solutions:
a. What measures are to be taken to prevent recurrence?
b. How do you implement the solution?
c. Who is responsible for solution implementation?
d. Are there pros and cons of solution implementation?
e. Have you checked to see if this root cause could be affecting any other processes, procedures, or systems? - Gather feedback and test solutions.
a. Discuss and share lessons learned/feedback while investigating an incident.
b. Test implemented solution(s) to make sure it addresses the root cause (not just the symptoms). - Monitor results.
a. Monitor results to ensure symptoms have stopped and ensure root cause has been fully addressed.
Pitfalls
The following are potential pitfalls you may run into while performing a root cause analysis:
- Blaming the human/workforce. Though it can be quite easy to blame a large number of issues on a human performance failure, we need to understand the conditions and circumstances which lead the human element to fail.
a. Have you asked yourself why the human failed?
b. Have you engaged in open dialogue/two-way communication with affected employees?
c. Was it due to a lack of training on procedures/processes?
d. Were the procedures/processes unclear or ambiguous?
e. Did the conditions present at the time of the incident prevent the person from focusing 100% on the required task or were there conditions that caused some form of deviation from stated/written processes/procedures? - Not including near misses as part of the analysis.
a. Near misses can assist with the root cause analysis in a couple of ways. They could help identify underlying systemic issues. This provides hints to potential internal controls as something occurred to result in catching the near miss before it became an issue.
Next time you encounter an issue in the field (or outside of work) I encourage you to try using a root case analysis to help mitigate the root of the problem. If you have any questions comments, concerns, or would like to discuss this topic further, you can contact me at Tony.Freeman@rfirst.org.