All Azure Technologies @ one Place
The Azure Well-Architected Framework (WAF) regards Failure Mode Analysis (FMA) as a fundamental necessity within the Reliability pillar (RE:03) rather than an advanced concept. The reasoning is straightforward: failures can occur, no matter how resilient your architecture may seem. More intricate environments are susceptible to a greater variety of failure types. FMA provides a systematic approach to identify these failure points in your critical workflows, to understand the potential impact of each failure, and to make informed decisions regarding mitigation before an outage forces those decisions upon you.
This article offers a comprehensive guide on how to execute FMA, detailing the five-step methodology, the eight categories of failure modes to examine for each component, the dependency considerations crucial to shaping your architecture, and the final documentation you will create.
Understanding FMA: Its Purpose and Limitations
FMA is a methodical approach aimed at pinpointing possible failure points in crucial workflows, coupled with tailored mitigation strategies for each. It is not merely a list of improbable risks or a checklist to be completed once prior to launch. Moreover, it should not be conflated with chaos engineering, despite their complementary nature.
According to the Azure WAF, a failure is defined as an unexpected event that disrupts the normal functioning of a component. For instance, a hardware fault that leads to a network partition is classified as a failure, as is a misconfigured routing rule that causes 30% of requests to be dropped. These instances differ from errors, which are anticipated issues managed through business logic, such as input validation failures, transient HTTP 429 responses, or null checks. This distinction is critical; failures necessitate architectural alterations, while errors are remedied through coding adjustments.
Core Principle of FMA Failures occur independent of the layers of resilience in place. More sophisticated environments face a broader spectrum of potential failures. FMA does not assume that all failures can be averted; rather, it recognises the necessity of understanding, ahead of time, which failures will disrupt specific flows, the scope of their impact, and strategies for addressing them.
The culmination of FMA is the documentation of decisions made: which failure modes have mitigations in place, which have been accepted as low-probability risks, and which flows remain vulnerable due to cost or complexity constraints. It is vital to make these choices consciously.
The Five-Step FMA Methodology
1
FMA is centred around workflows, not individual components. Before delving into any components, you must create a detailed map of user and system flows relevant to your workload, sorted by their importance. For example, the flow for user sign-in in a SaaS application generally takes precedence over a monthly invoice generation flow. The criticality of each flow dictates the level of investment that mitigation should receive. It’s assumed that this flow mapping is completed before commencing FMA; if not, that should be your initial step.
2
For each flow, determine the distinct components involved. These usually include: ingress control, networking, computing resources, data and storage solutions, supportive services (like authentication, messaging, or key management), and egress control. At the design phase, you might not know the exact services yet, and that’s acceptable. The aim is to create a component map through which each flow can be traced step by step.
3
Once your component map is ready, identify every dependency each component has—both internal (like internal APIs or Azure Key Vault) and external (such as Microsoft Entra ID or Azure ExpressRoute). For each dependency, gather its reliability metrics: availability SLA, scaling capacities, and any documented failover behaviour. This step uncovers hidden single points of failure.
4
Examine each flow in detail, evaluating how each component and its dependencies may be impacted by different classes of failure. Document what fails, what degrades, and what continues to function with each failure mode. Notably, analyse read and write failures separately. For instance, a database that can still accept reads during a storage issue has a different impact than one that is entirely offline. Several failure modes can affect the same component at once.
5
For each identified failure mode you decide to address, outline your mitigation strategy—whether that’s enhancing resilience (through redundancy, zone distribution, or regional failover) or implementing graceful degradation (like rerouting flows or disabling non-critical features). Also, establish how to detect these failures, specifying which metrics will breach thresholds, which alerts will be triggered, and how the on-call procedures will operate. Without detection, your mitigation plan is ineffective.
The Catalogue of Failure Modes
The WAF highlights eight categories of failure modes that each component should be assessed against. Gaining insight into the profile of each failure helps you determine which mitigations are worth implementing.
Regional Outage
An entire Azure region becomes inaccessible. This typically necessitates cross-region architecture (whether active-active or active-passive) to withstand such an event. This type of mitigation is the most expensive to implement but also the least likely scenario. For most workloads, simply documenting this exposure and accepting it is adequate unless your SLA stipulates otherwise.
Availability Zone Outage
One Availability Zone within a region fails. Zone-redundant deployment is the standard mitigation in terms of compute, data, and networking. This is less costly than cross-region strategies and addresses a more probable failure scenario. If your services lack zone redundancy, your FMA documentation should include a clear mitigation or acceptance of risk.
Service Outage
One or more Azure services become unavailable. Redundancy at the same or an alternate tier is essential, or graceful degradation should be considered if the service is not critical to all flows. Document the RTO and RPO impacts and check if your monitoring picks this up before customers experience issues.
DDoS or Malicious Attack
Azure handles Layer 3/4 DDoS attacks automatically. However, Layer 7 attacks are your responsibility. Leveraging Azure Front Door alongside the Azure Web Application Firewall can mitigate most of these threats, but it’s crucial to validate the WAF policy settings, rate limiting procedures, and bot protection configurations in your FMA. Don’t assume protection is in place without verification.
Misconfiguration
This is one of the more common and preventable failure modes. Changes, such as a routing rule modification or a certificate renewal, can inadvertently cause outages. Mitigation strategies include implementing infrastructure-as-code practices with automated validation, deployment gating, and rollback capabilities. For configurations managed externally, develop a process to catch issues before they reach production.
Operator Error
Human errors during routine operations or maintenance can lead to outages. Mitigation approaches include using Privileged Identity Management with temporary access, ensuring RBAC is scoped to essential permissions, adopting change management practices, and providing runbooks to help prevent common mistakes. Additionally, consider what your runbook instructs an engineer to do if they misinterpret an alert.
Planned Maintenance Outage
Scheduled maintenance windows that necessitate downtime are expected. Azure does have maintenance windows for some services. For your own components, employing blue-green deployments, rolling updates, and zero-downtime release strategies is advisable. This failure mode should be entirely preventable with the right deployment architecture.
Component Overload
When a component hits its scaling threshold or resource limit, it can lead to failure. Mitigation strategies include configuring autoscaling, performing load testing to ascertain limits before production, implementing circuit breakers within application code, and applying throttling policies on downstream dependencies. Overload failures often cascade; one slow component can cause timeouts that can overwhelm others.
Separately Examine Read and Write Failures A data service capable of handling reads during a storage issue has a markedly different impact profile than one that is completely down. Some workflows necessitate write capabilities (like checking out or form submission), while others may only require reads (for search, dashboards, and content display). Distinguishing these in your analysis can often reveal that the impact radius is either smaller or larger than initially assumed.
Strong vs. Weak Dependencies: A Decision with Lasting Impact on Your Architecture
Once your dependencies are catalogued, you need to classify them as strong or weak. This categorisation isn’t just academic; it influences the mitigation budget for that dependency and determines whether its SLA must align with yours.
Strong Dependencies
These components are essential for the workload’s operation. If any strong dependency is absent or degraded, the workflow fails or the workload becomes unavailable.
- Microsoft Entra ID for authentication flows
- Azure SQL for transaction processing
- Azure Key Vault for accessing secrets during startup
- Internal APIs involved in every user-facing call
Implication: The availability and recovery expectations of strong dependencies need to correspond with the workload’s targets. For instance, if your workload’s SLA is 99.9% and a strong dependency operates at 99.5%, your overall high-water mark is limited by the weaker component.
Weak Dependencies
These components may degrade certain features but do not disrupt primary workflows or render the workload unavailable.
- A recommendation engine—purchases can still occur without it
- An analytics event pipeline—transactions can be completed even if the event is lost
- A third-party enrichment API—data can revert to defaults if it times out
- Non-essential notification systems
Implication: Weak dependencies should be stabilised with timeouts and circuit breakers, and they should incorporate graceful fallback behavior. Aim to minimise coupling so that a weak dependency’s failure does not spill over into a strong dependency.
The classification of a dependency can vary between workflows. For instance, Microsoft Entra ID acts as a strong dependency for the sign-in process but may be a weak dependency for an anonymous product search. Document dependencies specifically for each flow, rather than trying to apply a one-size-fits-all approach.
Be Aware of Accidental Strong Dependencies A weak dependency that lacks a timeout can easily morph into a strong dependency when under load. If your application waits indefinitely for a recommendation engine that is down, it effectively becomes a strong dependency in that instance, regardless of your initial design documentation. Common failure paths include thread exhaustion and connection pool depletion. Implementing circuit breakers is crucial for any component that could be slow to respond.
The FMA Document: What You Will Ultimately Produce
The outcome of FMA is a table encapsulating each component, the failure mode being assessed, the likelihood of that failure, the effect on each workflow, the mitigation strategies in place, and the classification of any potential outage. This document is dynamic—it begins as hypothetical planning and evolves over time through chaos testing and real-world incidents.
Below is a sample table based on an e-commerce architecture as outlined in the Azure WAF documentation. This example features an application operating on Azure App Service with Azure SQL databases, fronted by Azure Front Door, and employing Microsoft Entra ID for authentication.
| Component | Failure Mode | Likelihood | Effect and Mitigation | Outage Scope |
|---|---|---|---|---|
| Microsoft Entra ID | Service outage | Low | Complete workload downtime for authenticated users. No mitigation other than Microsoft’s response. Ensure RTO expectations align with Entra SLA. | Full |
| Microsoft Entra ID | Misconfiguration | Medium | Users unable to log in, but no downstream data implications. Application manages auth exceptions and presents a clear error. Help desk escalation triggers a review by development. | External only |
| Azure Front Door | Service outage | Low | Complete outage for external users, with no internal bypass. Reliant on Microsoft for remediation. Ensure Azure Service Health alerts are set up for AFD degradation. | External only |
| Azure Front Door | Regional outage | Very low | Minimal impact. AFD is a global service; traffic is rerouted automatically to unaffected regions. No action required from the workload team. | None |
| Azure Front Door | DDoS attack (L7) | Medium | L3/L4 DDoS managed by Microsoft; L7 attacks are mitigated through WAF policy settings—rate limits, bot protection, and custom rules must be regularly reviewed. Potential for brief degradation if WAF rules are not current. | Potential partial |
| Azure SQL | Service outage | Low | Complete outage for all transactional flows. Read-only flows may function if a read replica is in place. Reliant on Microsoft for resolution. | Full |
| Azure SQL | Regional outage | Very low | Auto-failover group set up for the secondary region. Brief outage anticipated during failover. RTO and RPO must be validated through controlled failover tests. Failover is automated; manual intervention is not necessary. | Potential full |
| Azure SQL | Availability zone outage | Low | No impact expected. Zone-redundant configuration in place. Automatic failover occurs within the region. No action required. | None |
| App Service | Regional outage | Very low | Minimal impact. Azure Front Door reroutes traffic to non-affected instances. Increased latency for users in the impacted region. No data loss anticipated if the SQL failover is completed within the RPO timeframe. | None |
| App Service | Component overload | Medium | Autoscaling configured with scale-out rules triggered by CPU load and request queue depth. Load testing confirms scale-out completion within 3 minutes at double the peak load. Circuit breakers in application code prevent SQL connection pool depletion during overload. | Potential partial |
Prioritise Before Comprehensive Documentation Completing an FMA for a complex workload generates extensive documentation. Prior to devoting resources to mitigation strategies for every item, prioritise based on severity and likelihood. Multi-region outages should be documented and accepted as low-probability risks. Common issues like misconfiguration and operator error, which carry medium likelihood and are fully preventable, warrant much greater focus than regional outages in most commercial environments.
Tools in Azure Supporting FMA Work
FMA is a design-time initiative, but supportive tools span the design, testing, and ongoing operational phases.
A robust FMA practice not only shields your workloads but also protects your customers, your reputation, and your business continuity objectives.
FMA Is Not Just a One-Time Activity
A prevalent error regarding FMA is the belief that it is a gate that can be passed through before going live, then subsequently ignored. This document becomes outdated the moment any architectural alterations are made—an occurrence common in active workloads.
FMA should be revisited whenever significant architectural changes happen—whether that’s introducing a new service, onboarding a new external dependency, or deploying into a new region. It should also be reviewed after any incident, checking if the failure mode was previously documented (and if the mitigation held) or if it revealed a gap that needs addressing. Regular Chaos Studio experiments transform the FMA from a theoretical approach into a continually verified commitment.
Initial Steps for Those New to FMA Select your most critical user flow. Map every relevant component it interacts with. For each component, explore the eight failure modes and pose questions: what breaks, who detects it first, and what existing mitigation is in place? Document your findings thoroughly. It’s likely that you will uncover at least one single point of failure that nobody had considered. This discovery alone justifies the FMA process and provides a solid basis for expanding the analysis into broader applications.
Final Thoughts
FMA is one of those methodologies that may appear burdensome until its benefits become clear. When the moment arises, the payoff is substantial—because the team isn’t scrambling for solutions during an incident. The impact radius has already been established, the mitigation measures are in place, and detection occurs before customers are affected. This is the stark contrast between a reactive response and a structured operations call.
The WAF perspective is spot on: failures crop up regardless of apparent system resilience. FMA equips you with the capability to pre-emptively decide which failures you’ve accounted for, which are deemed tolerable risks, and which workflows are permitted to degrade gracefully against those that must remain fully operational. This is not merely busywork; it is foundational architecture.
Teams that forgo this process often find themselves in discussions post-incident, recalling, “We should have identified this.” Conversely, teams proficient in FMA deftly sidestep such meetings entirely.
Share this content:
Discover more from Qureshi
Subscribe to get the latest posts sent to your email.