Autonomous Self-Healing for Azure VMware Solution Private Clouds
The Azure VMware Solution operates a vast network of private clouds across the globe, all equipped with a complete VMware NSX and vCenter Server management framework. For instance, if a VMware NSX Manager cluster loses its quorum, it can trigger various related alerts. However, the documented consequences aren’t merely one cascading failure; management and control-plane updates can halt, cluster health might decline, and Edge or transport-node issues might occur. Thankfully, existing Tier-0 dynamic routing typically stays functional. In short, many alerts may stem from a common issue and need to be checked against the status of the cluster, service health, storage, the state of the Compute Manager, and connectivity with transport nodes. Without a clear model outlining the relationships among these layers, these alerts can appear as if multiple failures happened independently. If an operator addresses each alarm on its own, the outage may extend as they retrace the same steps repeatedly.
In a large-scale environment, the speed at which NSX identifies faults surpasses manual troubleshooting efforts consistently. To tackle this issue, the Azure VMware Solution Private Cloud Autonomous Self-Healing system has been designed as a closed-loop control framework. This system correlates control-plane signals based on a live runtime dependency graph, imposes a complete policy gate stack before taking any automated actions, ensures mutual exclusion is respected before any execution starts, and independently verifies recovery before closing any incidents. This article dives into the architecture of this system and the design decisions that influenced it.
The Azure VMware Solution is a fully verified first-party service from Microsoft that allows users to create private clouds featuring VMware vSphere clusters built on dedicated Azure infrastructure. This solution enables clients to utilise their existing VMware skills and tools, letting them concentrate on developing and operating their VMware-based workflows on Azure.
Below is a diagram that illustrates the architectural elements of the Azure VMware Solution.
Figure 1 – Architectural Components of Azure VMware Solution
Each component of the Azure VMware Solution has a specific role:
- Azure Subscription: Facilitates controlled access and manages budgets and quotas for the Azure VMware Solution.
- Azure Region: These are physical locations worldwide where we group data centres into Availability Zones (AZs) and further into regions.
- Azure Resource Group: A container that logically organizes Azure services and resources.
- Azure VMware Solution Private Cloud: Utilises VMware software, including vCenter Server, NSX software-defined networking, vSAN software-defined storage, and Azure bare-metal ESXi hosts to supply compute, networking, and storage resources. Additionally, Azure NetApp Files, Azure Elastic SAN, and Pure Cloud Block Store are supported.
- Azure VMware Solution Resource Cluster: Similar to the private cloud, it uses VMware software and Azure bare-metal ESXi hosts to provide necessary resources for customer workloads, scaling out the private cloud. Azure NetApp Files, Azure Elastic SAN, and Pure Cloud Block Store are also backed.
- VMware HCX: Offers mobility, migration, and network extension services.
- VMware Site Recovery: Automates disaster recovery and provides storage replication services via VMware vSphere Replication. Other solutions such as Zerto DR and JetStream DR are also supported.
- Dedicated Microsoft Enterprise Edge (D-MSEE): A router that connects the Azure cloud and the Azure VMware Solution private cloud instance.
- Azure Virtual Network (VNet): A private network that connects Azure services and resources together.
- Azure Route Server: Lets network appliances dynamically exchange route information with Azure networks.
- Azure Virtual Network Gateway: A gateway designed for securing connections between Azure services and resources with other private networks through IPSec VPN, ExpressRoute, and VNet to VNet.
- Azure ExpressRoute: Facilitates high-speed private connections between Azure data centres and on-premises or colocation infrastructure.
- Azure Virtual WAN (vWAN): Unifies networking, security, and routing functions into a single cohesive Wide Area Network (WAN).
Table I – System-Guaranteed Properties Introduced by Autonomous Self-Healing
Capability | What Autonomous Self-Heal Does | Prior State |
Bounded, verifiable recovery time | Measures the time from the first detected signal to stable recovery being confirmed. | Incidents were closed based on action completion rather than actual recovery. |
Signal integrity at ingestion | Standardizes events, removes duplicates, and suppresses flapping before correlation. | No normalization processes existed. Engineers dealt with raw alarm streams and identified causes through pattern recognition. |
Policy-gated execution | Checks freeze windows, risk budgets, blast radius, rate limits, and approvals simultaneously before execution. | No unified gating mechanism was consistently applied to enforce limits or approvals. |
Append-only incident evidence | Keeps signals, topology, decisions, workflow traces, and verification results in a structured record. | Evidence was housed across various logs, making it challenging to replay. |
Progressive trust model | Allows for notify-only mode, giving operators the chance to review detections and proposed actions before they are enabled. | Automation was binary with no mechanism to oversee system behaviour before granting execution authority. |
The Autonomous Self-Healing system introduces seven pivotal design elements to improve operations in the Azure VMware Solution private cloud’s control plane:
- Separation of detection, decision-making, and execution into distinct planes to isolate failure areas in the control loop.
- A live runtime dependency graph that continuously updates from VMware NSX and vCenter Server event streams, replacing outdated static rule sets.
- A three-input causal correlation model, assessing evidence strength, temporal order, and dependency directionality to differentiate cause-and-effect relationships from mere coincidences.
- A pre-execution blast-radius computation as a gate input, allowing proportional limits before taking any action.
- A phase boundary model transforming event-driven fluctuations into a smooth feedback loop with hysteresis.
- Enforced execution contracts that include triggers, gate declarations, step specifications, and verification contracts to maintain valid scopes and up-to-date topologies.
- A unified append-only ledger that documents identical records for both automated and human-driven resolutions, ensuring governance and post-incident reviews.
For failures it encompasses, the system guarantees timely and verifiable recovery without the need for operator intervention. In cases where automated fixes can’t be authorised, it generates a comprehensive evidence bundle, replacing memory with an organised, replayable handoff for engineers.
Autonomous Self-Healing distinguishes between detection, decision-making, and execution, ensuring these planes function separately with testable contracts connecting them. If these functions were coupled, such as a bug in the execution engine affecting evidence reliability, then alarm spikes could disrupt the gate evaluator, and policy gate misconfigurations could impede signal normalization. Separating these functions reduces the risk of mutual failures affecting one another.
Detection Plane: Converts raw VMware NSX and vCenter Server alarm signals into reliable incident candidates. It standardizes event formats, collapses duplicate signals, and uses a dwell window to filter out fleeting state changes. Only confirmed, stable candidates can pass through to the correlation model.
Decisioning Plane: Conducts causal correlations using the live private cloud dependency graph prior to gate evaluations and generates a ranked hypothesis of the root cause, complete with confidence scores and blast-radius estimates. It produces one of two outputs: a gated authorization to execute or an escalation, carrying a full evidence package.
Execution Plane: Engages a fencing token specific to the smallest failure domain, employs a versioned playbook, and does not close the incident until independent verification confirms stable recovery over a dwell period. Each state change adds to the incident ledger.
Figure 2 – Autonomous Self-Healing Control Loop
Autonomous Self-Healing keeps a comprehensive, append-only record for each incident, regardless of how it’s resolved. The sequence includes five categories: raw and normalized signals with suppression results; a topology snapshot during detection; an exhaustive decision record including correlation findings, root-cause ranking, blast-radius estimations, and gate evaluations; the workflow trace with step metadata; and the verification outcomes detailing post-condition findings. This ensures consistency—both automated and human-led resolutions generate identical record structures, a necessity for governance.
Figure 3 – The Incident Ledger: Audit, Replay, and Governance
Autonomous Self-Healing addresses specific NSX and vCenter control-plane failures within the Azure VMware Solution private cloud. It does not tackle data-plane failures, storage issues, hypervisor crashes, hardware malfunctions, or control-plane problems beyond its designed dependency graph. It also doesn’t execute arbitrary scripts, bypass role-based access controls, or cross tenant boundaries. The limited scope contributes to trustworthiness in the system; an attempt to manage everything would amplify its failure modes. When Autonomous Self-Healing can’t act, it produces a comprehensive evidence bundle, providing structured support for operators to respond effectively.
To discover more about the Azure VMware Solution, feel free to explore the link provided.
Rohan Bhosle is a Principal Software Engineering Manager at Microsoft Azure with over 19 years of experience in leading advanced technical work in hyperscale cloud networking, distributed control planes, and large-scale AI infrastructure. His expertise encompasses software-defined networking (SDN), multi-tenant isolation, policy enforcement, cloud and data centre architecture, routing, load balancing, telemetry, and reliability engineering. He has also worked extensively on the networking infrastructure needed to support next-gen AI systems.
Share this content:
Discover more from Qureshi
Subscribe to get the latest posts sent to your email.