Project Flash update: Advancing Azure Virtual Machine availability monitoring
Flash allows for the quick identification of issues stemming from the Azure platform, assisting teams in swiftly addressing infrastructure-related interruptions.
In a previous update about Project Flash as part of our Advancing Reliability blog series, we reaffirmed our dedication to supporting Azure customers in detecting and diagnosing virtual machine (VM) availability issues rapidly and accurately. This year, we’re thrilled to reveal innovative advancements that elevate VM availability monitoring—allowing our customers to manage their workloads on Azure with even more assurance. I’ve invited Yingqi (Halley) Ding, a Technical Program Manager from the Azure Core Compute team, to share the latest enhancements driving the next phase of Project Flash.
— Mark Russinovich, CTO, Deputy CISO, and Technical Fellow, Microsoft Azure.
Project Flash is a collaborative effort within Microsoft, aiming to provide precise telemetry, immediate alerts, and scalable monitoring—all packaged in a cohesive, user-friendly experience tailored to accommodate the varied observability needs concerning virtual machine (VM) availability.
Flash tackles both platform-level and user-level challenges. It facilitates the swift identification of issues that arise from the Azure platform, helping teams react promptly to infrastructure disruptions. Simultaneously, it provides you with actionable insights to diagnose and rectify issues in your own environment. This combined capability underpins high availability and ensures your business’s Service-Level Agreements (SLAs) are consistently achieved. Our goal is to empower you to:
- Obtain clear visibility into disruptions, such as VM reboots, application freezes due to network driver updates, and host OS updates, complete with detailed insights about what happened, why it happened, and whether it was planned or unexpected.
- Track trends and set alerts to expedite debugging and monitor availability over time.
- Scale your monitoring efforts and create custom dashboards to maintain oversight of the health of all your resources.
- Receive automated root cause analyses (RCAs) detailing which VMs were impacted, what triggered the issue, the duration of the problem, and the actions taken to resolve it.
- Get real-time notifications for critical events—such as degraded nodes needing VM redeployment or service healing triggered by hardware problems—enabling your teams to act swiftly and reduce user impact.
- Dynamically adjust recovery policies to align with evolving workload demands and business priorities.
Throughout our journey with Flash, it has been embraced by numerous leading companies across industries such as e-commerce, gaming, finance, and hedge funds. Their widespread use of Flash showcases its effectiveness in meeting the diverse requirements of high-profile organisations.
For BlackRock, the reliability of VMs is essential for our operations. If a VM is on deteriorating hardware, we need to be alerted swiftly, ensuring we can mitigate the problem before it affects users. With Project Flash, we are notified of a resource health event through our alerting process the moment a node in Azure is flagged as unallocatable, typically due to failing health. Our infrastructure team can then plan a migration of the affected resource to healthier hardware at the best time.This ability to proactively avert sudden VM failures has diminished our VM interruption rate and elevated the overall reliability of our investment platform.
— Eli Hamburger, Head of Infrastructure Hosting, BlackRock.
Available Solutions Today
The Flash initiative has developed into a powerful, scalable monitoring framework that addresses the varied needs of contemporary infrastructure—whether you manage just a few VMs or operate at a vast scale. Built with reliability at its foundation, Flash enables you to monitor what really matters, utilising the tools and telemetry that fit your architecture and operational model.
Flash publishes VM availability states and resource health annotations for comprehensive failure attribution and downtime analysis. The table below outlines the various options available, aiding you in selecting the right Flash monitoring solution for your specific scenario.
Solution | Description |
Azure Resource Graph (general availability) | For comprehensive investigations, centralized resource archives, and historical lookups, you can access resource availability telemetry across all workloads simultaneously using Azure Resource Graph (ARG). |
Event Grid system topic (public preview) | To initiate immediate and critical actions, such as redeploying or restarting VMs to avert end-user disruptions, you can receive alerts seconds after significant changes in resource availability via Event Handlers in Event Grid. |
Azure Monitor – Metrics (public preview) | For monitoring trends, aggregating platform metrics (like CPU and disk), and setting precise threshold-based alerts, you can use a readily-available VM availability metric through Azure Monitor. |
Resource Health (general availability) | For quick and easy health checks of individual resources in the Portal UI, you can easily view the Resource Health Check (RHC) blade and see a 30-day historical health overview for effective troubleshooting. |
What’s New?
New: User vs Platform Dimension for VM Availability Metric (Public Preview)
Several customers have highlighted the necessity for intuitive monitoring solutions that furnish real-time, scalable access to compute resource availability data, essential for spurring prompt mitigation actions in response to availability changes.
To meet this vital demand, the VM availability metric is now tailored for tracking trends, aggregating platform metrics (like CPU and disk usage), and enabling precise threshold-based alerts. You can take advantage of this readily-available VM availability metric via Azure Monitor.

You can now use the Context dimension to determine whether VM availability was impacted by Azure actions or user-initiated activities. This dimension reveals if disruptions or declines in the metric were triggered by platform actions or customer activities, with possible values including Platform, Customer, or Unknown.

This new dimension is also integrated into Azure Monitor alert rules, enhancing the filtering process.

New: Enable Health Resource Events to Azure Monitor Alerts in Event Grid (Public Preview)
Azure Event Grid is a fully-managed, highly scalable Pub/Sub messaging service that supports flexible message consumption patterns. Event Grid enables you to publish and subscribe to messages to facilitate Internet of Things (IoT) applications. By using HTTP, Event Grid allows the construction of event-driven solutions where a publisher service (like Project Flash) announces its state change events to subscriber apps.

With the new integration of Azure Monitor alerts as an event handler, you can now receive immediate notifications—such as VM availability changes and detailed annotations—via SMS, email, push notifications, and more. This amalgamation of Event Grid’s near real-time delivery with Azure Monitor’s direct alerting capabilities significantly enhances responsiveness.

To start using this feature, simply follow the step-by-step guide to begin receiving real-time alerts with Flash’s new functionalities.
What’s Next?
Looking forward, our focus will expand to include scenarios such as malfunctioning top-of-rack switches, failures in accelerated networking, and new forms of hardware failure prediction. We also aim to continually improve data quality and consistency across all Flash endpoints—helping enhance downtime attribution and providing deeper insight into VM availability.
For thorough VM availability monitoring—covering situations such as routine maintenance, live migration, service healing, and degradation—we recommend using both Flash Health events and Scheduled Events (SE).
- Flash Health events supply real-time insights into current and past availability disruptions, including VM degradation. This supports effective downtime management, automated mitigation strategies, and strengthens root cause analysis.
- Scheduled Events provide up to 15 minutes of advance notice prior to planned maintenance, allowing proactive decision-making and preparation. You can choose to acknowledge the event or postpone actions based on your operational readiness during this period.
For future updates regarding the Flash initiative, we invite you to follow our Advancing Reliability series!