[P2] [S3] Chapter 13
Monitoring, Metrics, and Continuous Improvement
Introduction
Impact tolerance is not a one-time definition—it is a dynamic capability that must be continuously monitored, measured, and refined. As organisations evolve, so do their services, technologies, dependencies, customer expectations, and risk environments.
Without ongoing monitoring, even well-defined tolerances can quickly become outdated or misaligned with reality.
To remain effective, organisations must establish a structured approach to metrics, monitoring, and continuous improvement, ensuring that impact tolerances remain relevant, achievable, and aligned with both operational capability and regulatory expectations.
Purpose of the Chapter
The purpose of this chapter is to:
- Define key metrics used to monitor impact tolerance
- Establish Key Risk Indicators (KRIs) linked to tolerance thresholds
- Implement continuous monitoring mechanisms
- Enable feedback loops for continuous improvement
- Ensure impact tolerances remain relevant over time
Key Metrics for Monitoring Impact Tolerance
Effective monitoring begins with clearly defined and measurable metrics.
Service Availability
Service availability is a fundamental indicator of resilience.
Key Measures:
- Percentage uptime of CBS
- Duration of service outages
- Frequency of disruptions
Example:
|
CBS |
Target Availability |
Actual Performance |
|
Deposit Services |
99.9% |
99.7% |
|
Payment Services |
99.95% |
99.8% |
Recovery Performance vs Tolerance
This measures how well the organisation performs relative to defined impact tolerances.
Key Measures:
- Actual recovery time vs MTD (Maximum Tolerable Downtime)
- Actual data loss vs MTDL (Maximum Tolerable Data Loss)
- Time taken to restore the minimum service capacity
Example
|
CBS |
Defined MTD |
Actual Recovery Time |
Result |
|
Deposit Services |
4 hours |
3.5 hours |
Within tolerance |
|
Payment Services |
2 hours |
2.5 hours |
Breach |
|
Digital Banking |
3 hours |
2 hours |
Within tolerance |
Capacity and Throughput Metrics
- Percentage of normal transaction capacity maintained
- Volume of processed vs failed transactions
- Backlog accumulation and clearance time
Customer Impact Metrics
- Number of customers affected
- Customer complaints and escalation rates
- Service response times
Key Risk Indicators (KRIs)
KRIs provide early warning signals that tolerance thresholds may be at risk.
Characteristics of Effective KRIs
KRIs should be:
- Forward-looking (predict potential issues)
- Measurable and quantifiable
- Linked to impact tolerance thresholds
- Actionable with defined triggers
Example KRIs
|
KRI |
Threshold |
Action Trigger |
|
System uptime degradation |
< 98% |
Investigate and escalate |
|
Transaction backlog growth |
> 20% increase |
Activate mitigation measures |
|
Third-party service latency |
> 30% above baseline |
Engage the vendor and monitor |
|
Incident frequency |
> 3 major incidents/month |
Review root causes |
|
Staff availability |
< 80% critical roles filled |
Activate contingency staffing |
Linking KRIs to Impact Tolerance
KRIs should signal:
- Approaching tolerance limits
- Increased likelihood of disruption
- Potential cascading failures
Key Principle
KRIs enable organisations to act before tolerance is breached, not after
Continuous Monitoring Mechanisms
Monitoring must be supported by systems and processes that provide real-time or near real-time visibility.
Monitoring Tools and Systems
- System performance dashboards
- Application monitoring tools
- Network and infrastructure monitoring
- Third-party service monitoring platforms
- Incident management systems
Operational Monitoring
Operational teams should monitor:
- Service performance against thresholds
- System health and alerts
- Transaction flows and backlog
- Customer service indicators
Management Reporting
Regular reporting should include:
|
Reporting Level |
Focus |
|
Operational |
Daily/real-time performance metrics |
|
Management |
Weekly/monthly performance trends |
|
Senior Management / Board |
Strategic overview, breaches, and risks |
Early Warning Systems
Organisations should implement alerts for:
- Approaching tolerance thresholds
- System degradation
- Third-party failures
- Increased incident frequency
Feedback Loops and Lessons Learned
Continuous improvement relies on structured feedback mechanisms.
Sources of Feedback
|
Source |
Insight Provided |
|
Incident Reports |
Actual disruption impact and response effectiveness |
|
Scenario Testing |
Performance under simulated stress conditions |
|
Customer Feedback |
Perceived service quality and pain points |
|
Audit Findings |
Governance and control weaknesses |
|
Regulatory Feedback |
Compliance gaps and expectations |
|
Operational Metrics |
Trends and performance deviations |
Lessons Learned Process
A structured approach should include:
- Capture
- Document incidents, test results, and observations
- Analyse
- Identify root causes and contributing factors
- Evaluate
- Assess impact relative to defined tolerances
- Improve
- Implement corrective actions and enhancements
- Update
- Revise impact tolerances, processes, or controls if required
Example
|
Event |
Lesson Learned |
Improvement Action |
|
Payment outage |
Recovery time exceeded tolerance |
Upgrade failover systems |
|
Cyber incident |
Detection delay |
Enhance monitoring tools |
|
Third-party failure |
Lack of backup vendor |
Establish an alternate provider |
Continuous Improvement Framework
Impact tolerance should evolve through a structured improvement cycle.
Improvement Cycle
- Define impact tolerance
- Monitor performance
- Detect deviations
- Analyse root causes
- Implement improvements
- Reassess tolerance
Key Drivers of Change
- Technology upgrades or failures
- Changes in customer behaviour
- New regulatory requirements
- Emerging risks (e.g., cyber threats, supply chain disruptions)
- Organisational changes (e.g., mergers, outsourcing)
Integration with Operational Resilience Lifecycle
Monitoring and continuous improvement support all stages of the lifecycle:
|
Lifecycle Stage |
Role of Monitoring |
|
Plan |
Define metrics and KRIs |
|
Implement |
Monitor performance against tolerance |
|
Test |
Validate through scenario testing |
|
Improve |
Refine tolerances and capabilities |
Common Challenges
|
Challenge |
Description |
|
Inadequate metrics |
Lack of meaningful or measurable indicators |
|
Data fragmentation |
Inconsistent data across systems |
|
Delayed reporting |
Lack of real-time visibility |
|
Reactive approach |
Acting only after incidents occur |
|
Weak feedback loops |
Lessons not translated into improvements |
Best Practices
- Define clear, measurable metrics aligned with impact tolerance
- Implement real-time monitoring and alerting systems
- Use KRIs to provide early warning signals
- Establish structured feedback and lessons learned processes
- Integrate monitoring into governance and reporting frameworks
- Regularly review and update metrics and tolerances
- Foster a culture of continuous improvement
Monitoring, metrics, and continuous improvement are essential to ensuring that impact tolerances remain relevant and effective in a changing environment. By establishing clear performance indicators, implementing robust monitoring systems, and embedding structured feedback loops, organisations can maintain visibility over their resilience capabilities and respond proactively to emerging risks.
Continuous improvement transforms impact tolerance from a static threshold into a living capability, enabling organisations to adapt, strengthen, and sustain resilience over time. Ultimately, this ensures that critical business services can be delivered consistently within acceptable limits, even in the face of evolving disruptions.





![[OR] [P2] [S3] [ITo] [C13] Monitoring, Metrics, and Continuous Improvement](https://no-cache.hubspot.com/cta/default/3893111/1a32f981-3a16-427a-a63f-5a40ab93ea21.png)
![Banner [Summing] [OR] [E3] Establish Impact Tolerance](https://no-cache.hubspot.com/cta/default/3893111/5e80e50f-5e3e-44ea-8c43-16bf42d4f3b5.png)

![[OR] [P2] [S3] [ITo] [C1] Introduction to Impact Tolerance](https://no-cache.hubspot.com/cta/default/3893111/a2d06a13-c2ac-4e0a-b8ea-c5afcab91844.png)
![[OR] [P2] [S3] [ITo] [C2] Regulatory and Standards Landscape](https://no-cache.hubspot.com/cta/default/3893111/04df8f17-629c-458f-af01-67e3da528b63.png)
![[OR] [P2] [S3] [ITo] [C3] Understanding Impact Tolerance in Context](https://no-cache.hubspot.com/cta/default/3893111/ea66bac0-7b34-4d56-9c93-c33c8f7964bc.png)
![[OR] [P2] [S3] [ITo] [C4] Linking Impact Tolerance to Critical Business Services (CBS)](https://no-cache.hubspot.com/cta/default/3893111/24ceb290-50c2-4af4-be00-41894f00c7cb.png)
![[OR] [P2] [S3] [ITo] [C5] Key Components of Impact Tolerance](https://no-cache.hubspot.com/cta/default/3893111/6e9d8a15-c0a3-4e28-b9a4-c2dcc3e2081e.png)
![[OR] [P2] [S3] [ITo] [C6] Methodology for Setting Impact Tolerance](https://no-cache.hubspot.com/cta/default/3893111/77526e47-fc15-4c7b-bf03-cadd672b40db.png)
![[OR] [P2] [S3] [ITo] [C7] Impact Tolerance Assessment Framework](https://no-cache.hubspot.com/cta/default/3893111/abf28462-aba4-4970-81be-55cf66dc6147.png)
![[OR] [P2] [S3] [ITo] [C8] Scenario-Based Calibration of Impact Tolerance](https://no-cache.hubspot.com/cta/default/3893111/23b3a54d-37ce-494b-acb1-33b3cc5e1655.png)
![[OR] [P2] [S3] [ITo] [C9] Role of Dependency Mapping in Impact Tolerance](https://no-cache.hubspot.com/cta/default/3893111/d35fd8b0-e936-4ab3-9706-4366bfcb8cbe.png)
![[OR] [P2] [S3] [ITo] [C10] Governance, Ownership, and Accountability](https://no-cache.hubspot.com/cta/default/3893111/de12fefd-b6c6-4156-83a9-5d19ca5bc508.png)
![[OR] [P2] [S3] [ITo] [C11] Integration with Operational Resilience Framework](https://no-cache.hubspot.com/cta/default/3893111/84d3d3c4-0647-4ffd-99b4-a20a12526019.png)
![[OR] [P2] [S3] [ITo] [C12] Testing and Validation of Impact Tolerances](https://no-cache.hubspot.com/cta/default/3893111/9a9cb7eb-1ca3-4790-b39e-f6b0035a1eae.png)
![[OR] [P2] [S3] [ITo] [C14] Common Challenges and Pitfalls](https://no-cache.hubspot.com/cta/default/3893111/8831463d-a357-4203-806b-fb31ef71d615.png)
![[OR] [P2] [S3] [ITo] [C15] Practical Case Study (Banking Sector Example)](https://no-cache.hubspot.com/cta/default/3893111/fef15761-14c6-4e2b-b157-554cceb33d14.png)
![[OR] [P2] [S3] [ITo] [C16] Future Trends in Impact Tolerance](https://no-cache.hubspot.com/cta/default/3893111/b6a701db-167e-4630-88ad-de0d43deb322.png)
![[OR] [P2] [S3] [ITo] [C17] Key Takeaways and Call to Action](https://no-cache.hubspot.com/cta/default/3893111/bf49e0c2-33a3-48bc-97d2-eb939aed77bd.png)
![[OR] [P2] [S3] [ITo] [C18] Back Cover](https://no-cache.hubspot.com/cta/default/3893111/3623335d-0b26-4ee7-afbf-0d431358b390.png)





![[BL-OR] [3-4-5] View Schedule](https://no-cache.hubspot.com/cta/default/3893111/d0d733a1-16c0-4b68-a26d-adbfd4fc6069.png)
![[BL-OR] [3] FAQ OR-300](https://no-cache.hubspot.com/cta/default/3893111/f20c71b4-f5e8-4aa5-8056-c374ca33a091.png)
![Email to Sales Team [BCM Institute]](https://no-cache.hubspot.com/cta/default/3893111/3c53daeb-2836-4843-b0e0-645baee2ab9e.png)









