eBook OR

[OR] [P2] [S5] [LL] [C6] Root Cause Analysis (RCA) Techniques

Written by Moh Heng Goh | May 14, 2026 2:52:55 PM

[P2] [S5] Chapter 6

Root Cause Analysis (RCA) Techniques

Introduction

One of the most common weaknesses in lessons learned processes is the tendency to address symptoms rather than root causes. Superficial fixes may resolve immediate issues but often lead to:

  • Recurring incidents
  • Persistent control weaknesses
  • Ineffective resilience improvements

Root Cause Analysis (RCA) is a critical discipline that ensures organisations move beyond “what happened” to understand why it happened and what must change.

In the context of operational resilience, RCA is essential to:

  • Protect Critical Business Services (CBS)
  • Prevent breaches of impact tolerance
  • Strengthen end-to-end service delivery

Purpose of the Chapter

To provide a practical and structured approach to Root Cause Analysis (RCA), enabling organisations to identify the true underlying causes of disruptions and ensure that lessons learned lead to effective, sustainable improvements in operational resilience.

 

Definition and Objectives of RCA

Definition

Root Cause Analysis is a systematic process used to identify the fundamental causes of an incident, disruption, or failure.

Objectives
  • Identify the true underlying causes of events
  • Distinguish between symptoms and root causes
  • Prevent recurrence of incidents
  • Improve resilience capabilities

 

Types of Causes

Understanding different types of causes is essential for effective RCA.

Immediate Cause
  • The direct trigger of the incident
  • Example: System crash
Contributing Factors
  • Conditions that enabled the incident
  • Example: Lack of monitoring
Root Cause
  • The fundamental issue that allowed the incident to occur
  • Example: Inadequate system design or governance
 Example

Level

Description

Immediate Cause

Payment system outage

Contributing Factor

Server overload

Root Cause

Lack of capacity planning and stress testing

 

Principles of Effective RCA

Focus on Systems, Not Individuals
  • Avoid blame culture
  • Identify systemic weaknesses
Evidence-Based Analysis
  • Use data, logs, and factual information
  • Avoid assumptions
Structured Approach
  • Follow defined methodologies
  • Ensure consistency
Service-Centric Perspective
  • Focus on impact to CBS
  • Consider end-to-end service delivery
Cross-Functional Collaboration
  • Involve multiple stakeholders:
    • Business
    • IT
    • Risk
    • Vendors

 

Key RCA Techniques

The 5 Whys Technique

Overview

A simple but powerful method that involves asking “Why?” repeatedly to drill down to the root cause.

Example

Question

Answer

Why did the system fail?

Because the server crashed

Why did the server crash?

Because it was overloaded

Why was it overloaded?

Because capacity limits were exceeded

Why were limits exceeded?

Because demand forecasting was inaccurate

Why was forecasting inaccurate?

Because monitoring and analytics were insufficient

Strengths

  • Easy to apply
  • Effective for straightforward issues

Limitations

  • May oversimplify complex problems
  • Depends on facilitator skill
Fishbone (Ishikawa) Diagram

Overview

A visual tool used to categorise potential causes into key domains.

Categories

  • People
  • Process
  • Technology
  • Environment
  • Third-party

Application

  • Identify multiple contributing factors
  • Explore relationships between causes

Strengths

  • Comprehensive analysis
  • Encourages structured thinking
Fault Tree Analysis (FTA)

Overview

A top-down approach that maps the logical relationships between failures.

Application

  • Used for complex systems
  • Identifies combinations of failures

Strengths

  • Detailed and systematic
  • Suitable for high-impact incidents
Event Timeline Analysis

Overview

Reconstructs the sequence of events leading to an incident.

Application

  • Identify breakdown points
  • Understand decision-making failures

Strengths

  • Provides context
  • Highlights timing issues
Barrier Analysis

Overview

Examines why controls or safeguards failed.

Application

  • Identify gaps in controls
  • Evaluate effectiveness of safeguards

 

Linking RCA to Critical Business Services (CBS)

RCA must be aligned with the service-centric approach of operational resilience.

Mapping RCA to CBS
  • Identify which CBS was impacted
  • Determine how the disruption affected service delivery
Understanding End-to-End Impact
  • Analyse dependencies:
    • Upstream processes
    • Downstream services
  • Identify cascading failures
Strengthening Service Resilience
  • Focus on improving:
    • Service continuity
    • Customer outcomes

 

RCA and Impact Tolerance

Assessing Tolerance Breaches
  • Determine whether impact tolerance was breached
  • Identify conditions leading to breach
Refining Tolerance Levels
  • Use RCA insights to:
    • Adjust thresholds
    • Improve monitoring
Enhancing Controls
  • Strengthen controls to prevent future breaches

 

Integrating RCA into Lessons Learned

RCA is a critical component of the lessons learned process.

From Observation to Lesson
  • Observation: What happened
  • RCA: Why it happened
  • Lesson Learned: What must change
Ensuring Actionable Outcomes
  • Link RCA findings to:
    • Specific improvement actions
    • Measurable outcomes

 

Common Pitfalls in RCA

Organisations often face the following challenges:

Superficial Analysis
  • Stopping at immediate causes
  • Failing to identify root causes
 Blame Culture
  • Focusing on individuals instead of systems
Lack of Data
  • Insufficient evidence
  • Poor documentation
Limited Scope
  • Ignoring interdependencies
  • Focusing on isolated components
Poor Follow-Through
  • Failure to implement corrective actions

 

Best Practices for Effective RCA

Establish Standard Methodologies
  • Use consistent RCA techniques
Train Personnel
  • Develop RCA skills across the organisation
Use Technology and Tools
  • RCA software
  • Data analytics
Integrate Across Functions
  • Collaborate across:
    • IT
    • Operations
    • Risk
Validate Findings
  • Ensure accuracy and completeness

 

Case Example: Payment System Disruption

Incident

A bank experiences a payment processing outage affecting customers.

RCA Findings

  • Immediate Cause: System overload
  • Contributing Factors:
    • Ineffective monitoring
    • Delayed response
  • Root Cause:
    • Lack of capacity planning
    • Inadequate stress testing

Lessons Learned

  • Need for improved capacity planning
  • Enhanced monitoring systems

Improvement Actions

  • Upgrade infrastructure
  • Implement real-time monitoring
  • Conduct regular stress testing

 

Embedding RCA into Organisational Culture

Promote a Learning Culture
  • Encourage open discussion
  • Avoid blame
Leadership Support
  • Ensure management commitment
Continuous Improvement
  • Use RCA as a tool for ongoing enhancement

Root Cause Analysis is a cornerstone of effective lessons learned and a critical enabler of operational resilience. By identifying and addressing the true causes of disruptions, organisations can:

  • Prevent recurrence
  • Strengthen Critical Business Services
  • Improve impact tolerance adherence
  • Enhance overall resilience maturity

Without robust RCA, lessons learned remain incomplete and ineffective.

 

Transition to Next Chapter

With a strong foundation in Root Cause Analysis, the next chapter will focus on linking lessons learned to Critical Business Services (CBS), ensuring that improvements are aligned with service delivery and customer impact.

 

C1 C2 C3 C4 C5 C6
C7 C8 C9 C10 C11 C12 
C13 C14 C15 C16 C17  
 

 

More Information About OR-5000 [OR-5] or OR-300 [OR-3]

To learn more about the course and schedule, click the buttons below for the OR-300 Operational Resilience Implementer course and the OR-5000 Operational Resilience Expert Implementer course.

If you have any questions, click to contact us.