[OR] [P2] [S3] [ITo] [C13] Monitoring, Metrics, and Continuous Improvement

Written by Moh Heng Goh | May 8, 2026 9:58:33 AM

[P2] [S3] Chapter 13

Monitoring, Metrics, and Continuous Improvement

Introduction

Impact tolerance is not a one-time definition—it is a dynamic capability that must be continuously monitored, measured, and refined. As organisations evolve, so do their services, technologies, dependencies, customer expectations, and risk environments.

Without ongoing monitoring, even well-defined tolerances can quickly become outdated or misaligned with reality.

To remain effective, organisations must establish a structured approach to metrics, monitoring, and continuous improvement, ensuring that impact tolerances remain relevant, achievable, and aligned with both operational capability and regulatory expectations.

Purpose of the Chapter

The purpose of this chapter is to:

Define key metrics used to monitor impact tolerance
Establish Key Risk Indicators (KRIs) linked to tolerance thresholds
Implement continuous monitoring mechanisms
Enable feedback loops for continuous improvement
Ensure impact tolerances remain relevant over time

Key Metrics for Monitoring Impact Tolerance

Effective monitoring begins with clearly defined and measurable metrics.

Service Availability

Service availability is a fundamental indicator of resilience.

Key Measures:

Percentage uptime of CBS
Duration of service outages
Frequency of disruptions

Example:

CBS	Target Availability	Actual Performance
Deposit Services	99.9%	99.7%
Payment Services	99.95%	99.8%

Recovery Performance vs Tolerance

This measures how well the organisation performs relative to defined impact tolerances.

Key Measures:

Actual recovery time vs MTD (Maximum Tolerable Downtime)
Actual data loss vs MTDL (Maximum Tolerable Data Loss)
Time taken to restore the minimum service capacity

Example

CBS	Defined MTD	Actual Recovery Time	Result
Deposit Services	4 hours	3.5 hours	Within tolerance
Payment Services	2 hours	2.5 hours	Breach
Digital Banking	3 hours	2 hours	Within tolerance

Capacity and Throughput Metrics

Percentage of normal transaction capacity maintained
Volume of processed vs failed transactions
Backlog accumulation and clearance time

Customer Impact Metrics

Number of customers affected
Customer complaints and escalation rates
Service response times

Key Risk Indicators (KRIs)

KRIs provide early warning signals that tolerance thresholds may be at risk.

Characteristics of Effective KRIs

KRIs should be:

Forward-looking (predict potential issues)
Measurable and quantifiable
Linked to impact tolerance thresholds
Actionable with defined triggers

Example KRIs

KRI	Threshold	Action Trigger
System uptime degradation	< 98%	Investigate and escalate
Transaction backlog growth	> 20% increase	Activate mitigation measures
Third-party service latency	> 30% above baseline	Engage the vendor and monitor
Incident frequency	> 3 major incidents/month	Review root causes
Staff availability	< 80% critical roles filled	Activate contingency staffing

Linking KRIs to Impact Tolerance

KRIs should signal:

Approaching tolerance limits
Increased likelihood of disruption
Potential cascading failures

Key Principle

KRIs enable organisations to act before tolerance is breached, not after

Continuous Monitoring Mechanisms

Monitoring must be supported by systems and processes that provide real-time or near real-time visibility.

Monitoring Tools and Systems

System performance dashboards
Application monitoring tools
Network and infrastructure monitoring
Third-party service monitoring platforms
Incident management systems

Operational Monitoring

Operational teams should monitor:

Service performance against thresholds
System health and alerts
Transaction flows and backlog
Customer service indicators

Management Reporting

Regular reporting should include:

Reporting Level	Focus
Operational	Daily/real-time performance metrics
Management	Weekly/monthly performance trends
Senior Management / Board	Strategic overview, breaches, and risks

Early Warning Systems

Organisations should implement alerts for:

Approaching tolerance thresholds
System degradation
Third-party failures
Increased incident frequency

Feedback Loops and Lessons Learned

Continuous improvement relies on structured feedback mechanisms.

Sources of Feedback

Source	Insight Provided
Incident Reports	Actual disruption impact and response effectiveness
Scenario Testing	Performance under simulated stress conditions
Customer Feedback	Perceived service quality and pain points
Audit Findings	Governance and control weaknesses
Regulatory Feedback	Compliance gaps and expectations
Operational Metrics	Trends and performance deviations

Lessons Learned Process

A structured approach should include:

Capture

Document incidents, test results, and observations

Analyse

Identify root causes and contributing factors

Evaluate

Assess impact relative to defined tolerances

Improve

Implement corrective actions and enhancements

Update

Revise impact tolerances, processes, or controls if required

Example

Event	Lesson Learned	Improvement Action
Payment outage	Recovery time exceeded tolerance	Upgrade failover systems
Cyber incident	Detection delay	Enhance monitoring tools
Third-party failure	Lack of backup vendor	Establish an alternate provider

Continuous Improvement Framework

Impact tolerance should evolve through a structured improvement cycle.

Improvement Cycle

Define impact tolerance
Monitor performance
Detect deviations
Analyse root causes
Implement improvements
Reassess tolerance

Key Drivers of Change

Technology upgrades or failures
Changes in customer behaviour
New regulatory requirements
Emerging risks (e.g., cyber threats, supply chain disruptions)
Organisational changes (e.g., mergers, outsourcing)

Integration with Operational Resilience Lifecycle

Monitoring and continuous improvement support all stages of the lifecycle:

Lifecycle Stage	Role of Monitoring
Plan	Define metrics and KRIs
Implement	Monitor performance against tolerance
Test	Validate through scenario testing
Improve	Refine tolerances and capabilities

Common Challenges

Challenge	Description
Inadequate metrics	Lack of meaningful or measurable indicators
Data fragmentation	Inconsistent data across systems
Delayed reporting	Lack of real-time visibility
Reactive approach	Acting only after incidents occur
Weak feedback loops	Lessons not translated into improvements

Best Practices

Define clear, measurable metrics aligned with impact tolerance
Implement real-time monitoring and alerting systems
Use KRIs to provide early warning signals
Establish structured feedback and lessons learned processes
Integrate monitoring into governance and reporting frameworks
Regularly review and update metrics and tolerances
Foster a culture of continuous improvement

Monitoring, metrics, and continuous improvement are essential to ensuring that impact tolerances remain relevant and effective in a changing environment. By establishing clear performance indicators, implementing robust monitoring systems, and embedding structured feedback loops, organisations can maintain visibility over their resilience capabilities and respond proactively to emerging risks.

Continuous improvement transforms impact tolerance from a static threshold into a living capability, enabling organisations to adapt, strengthen, and sustain resilience over time. Ultimately, this ensures that critical business services can be delivered consistently within acceptable limits, even in the face of evolving disruptions.

C1	C2	C3	C4	C5	C6

C7	C8	C9	C10	C11	C12

C13	C14	C15	C16	C17	C18

More Information About OR-5000 [OR-5] or OR-300 [OR-3]

To learn more about the course and schedule, click the buttons below for the OR-300 Operational Resilience Implementer course and the OR-5000 Operational Resilience Expert Implementer course.



	If you have any questions, click to contact us.

View full post