eBook OR

[OR] [P2] [S3] [ITo] [C13] Monitoring, Metrics, and Continuous Improvement

Written by Moh Heng Goh | May 8, 2026 9:58:33 AM

[P2] [S3] Chapter 13

Monitoring, Metrics, and Continuous Improvement

Introduction

Impact tolerance is not a one-time definition—it is a dynamic capability that must be continuously monitored, measured, and refined. As organisations evolve, so do their services, technologies, dependencies, customer expectations, and risk environments.

Without ongoing monitoring, even well-defined tolerances can quickly become outdated or misaligned with reality.

To remain effective, organisations must establish a structured approach to metrics, monitoring, and continuous improvement, ensuring that impact tolerances remain relevant, achievable, and aligned with both operational capability and regulatory expectations.

Purpose of the Chapter

The purpose of this chapter is to:

  • Define key metrics used to monitor impact tolerance
  • Establish Key Risk Indicators (KRIs) linked to tolerance thresholds
  • Implement continuous monitoring mechanisms
  • Enable feedback loops for continuous improvement
  • Ensure impact tolerances remain relevant over time

Key Metrics for Monitoring Impact Tolerance

Effective monitoring begins with clearly defined and measurable metrics.

Service Availability

Service availability is a fundamental indicator of resilience.

Key Measures:

  • Percentage uptime of CBS
  • Duration of service outages
  • Frequency of disruptions

Example:

CBS

Target Availability

Actual Performance

Deposit Services

99.9%

99.7%

Payment Services

99.95%

99.8%

Recovery Performance vs Tolerance

This measures how well the organisation performs relative to defined impact tolerances.

Key Measures:

  • Actual recovery time vs MTD (Maximum Tolerable Downtime)
  • Actual data loss vs MTDL (Maximum Tolerable Data Loss)
  • Time taken to restore the minimum service capacity

Example

CBS

Defined MTD

Actual Recovery Time

Result

Deposit Services

4 hours

3.5 hours

Within tolerance

Payment Services

2 hours

2.5 hours

Breach

Digital Banking

3 hours

2 hours

Within tolerance

Capacity and Throughput Metrics
  • Percentage of normal transaction capacity maintained
  • Volume of processed vs failed transactions
  • Backlog accumulation and clearance time
Customer Impact Metrics
  • Number of customers affected
  • Customer complaints and escalation rates
  • Service response times

Key Risk Indicators (KRIs)

KRIs provide early warning signals that tolerance thresholds may be at risk.

Characteristics of Effective KRIs

KRIs should be:

  • Forward-looking (predict potential issues)
  • Measurable and quantifiable
  • Linked to impact tolerance thresholds
  • Actionable with defined triggers
Example KRIs

KRI

Threshold

Action Trigger

System uptime degradation

< 98%

Investigate and escalate

Transaction backlog growth

> 20% increase

Activate mitigation measures

Third-party service latency

> 30% above baseline

Engage the vendor and monitor

Incident frequency

> 3 major incidents/month

Review root causes

Staff availability

< 80% critical roles filled

Activate contingency staffing

Linking KRIs to Impact Tolerance

KRIs should signal:

  • Approaching tolerance limits
  • Increased likelihood of disruption
  • Potential cascading failures

Key Principle

KRIs enable organisations to act before tolerance is breached, not after

Continuous Monitoring Mechanisms

Monitoring must be supported by systems and processes that provide real-time or near real-time visibility.

Monitoring Tools and Systems
  • System performance dashboards
  • Application monitoring tools
  • Network and infrastructure monitoring
  • Third-party service monitoring platforms
  • Incident management systems
Operational Monitoring

Operational teams should monitor:

  • Service performance against thresholds
  • System health and alerts
  • Transaction flows and backlog
  • Customer service indicators
Management Reporting

Regular reporting should include:

Reporting Level

Focus

Operational

Daily/real-time performance metrics

Management

Weekly/monthly performance trends

Senior Management / Board

Strategic overview, breaches, and risks

Early Warning Systems

Organisations should implement alerts for:

  • Approaching tolerance thresholds
  • System degradation
  • Third-party failures
  • Increased incident frequency

Feedback Loops and Lessons Learned

Continuous improvement relies on structured feedback mechanisms.

Sources of Feedback

Source

Insight Provided

Incident Reports

Actual disruption impact and response effectiveness

Scenario Testing

Performance under simulated stress conditions

Customer Feedback

Perceived service quality and pain points

Audit Findings

Governance and control weaknesses

Regulatory Feedback

Compliance gaps and expectations

Operational Metrics

Trends and performance deviations

Lessons Learned Process

A structured approach should include:

  • Capture
      • Document incidents, test results, and observations
  • Analyse
      • Identify root causes and contributing factors
  • Evaluate
      • Assess impact relative to defined tolerances
  • Improve
      • Implement corrective actions and enhancements
  • Update
    • Revise impact tolerances, processes, or controls if required

Example

Event

Lesson Learned

Improvement Action

Payment outage

Recovery time exceeded tolerance

Upgrade failover systems

Cyber incident

Detection delay

Enhance monitoring tools

Third-party failure

Lack of backup vendor

Establish an alternate provider

Continuous Improvement Framework

Impact tolerance should evolve through a structured improvement cycle.

Improvement Cycle
  1. Define impact tolerance
  2. Monitor performance
  3. Detect deviations
  4. Analyse root causes
  5. Implement improvements
  6. Reassess tolerance
Key Drivers of Change
  • Technology upgrades or failures
  • Changes in customer behaviour
  • New regulatory requirements
  • Emerging risks (e.g., cyber threats, supply chain disruptions)
  • Organisational changes (e.g., mergers, outsourcing)

Integration with Operational Resilience Lifecycle

Monitoring and continuous improvement support all stages of the lifecycle:

Lifecycle Stage

Role of Monitoring

Plan

Define metrics and KRIs

Implement

Monitor performance against tolerance

Test

Validate through scenario testing

Improve

Refine tolerances and capabilities

Common Challenges

Challenge

Description

Inadequate metrics

Lack of meaningful or measurable indicators

Data fragmentation

Inconsistent data across systems

Delayed reporting

Lack of real-time visibility

Reactive approach

Acting only after incidents occur

Weak feedback loops

Lessons not translated into improvements

Best Practices

  • Define clear, measurable metrics aligned with impact tolerance
  • Implement real-time monitoring and alerting systems
  • Use KRIs to provide early warning signals
  • Establish structured feedback and lessons learned processes
  • Integrate monitoring into governance and reporting frameworks
  • Regularly review and update metrics and tolerances
  • Foster a culture of continuous improvement

Monitoring, metrics, and continuous improvement are essential to ensuring that impact tolerances remain relevant and effective in a changing environment. By establishing clear performance indicators, implementing robust monitoring systems, and embedding structured feedback loops, organisations can maintain visibility over their resilience capabilities and respond proactively to emerging risks.

Continuous improvement transforms impact tolerance from a static threshold into a living capability, enabling organisations to adapt, strengthen, and sustain resilience over time. Ultimately, this ensures that critical business services can be delivered consistently within acceptable limits, even in the face of evolving disruptions.

C1 C2 C3 C4 C5 C6
C7 C8 C9 C10 C11 C12 
C13 C14 C15 C16 C17 C18

 

More Information About OR-5000 [OR-5] or OR-300 [OR-3]

To learn more about the course and schedule, click the buttons below for the OR-300 Operational Resilience Implementer course and the OR-5000 Operational Resilience Expert Implementer course.

If you have any questions, click to contact us.