How we build Azure SRE Agent with agentic workflows
Microsoft runs mission-critical systems that are always on and incredibly vast. With thousands of services, millions of deployments, and continuous changes, modern cloud engineering is a reality. These massive systems support organisations worldwide—including our own—and there’s virtually no tolerance for downtime. While tasks like investigating incidents, responding to alerts, and fixing issues are crucial, they can be disruptive to innovation.
For engineers, much of this operational work means stepping back from developing new features to handle alerts, analyse logs, connect metrics from various systems, or respond to issues at any time. Being on-call and conducting manual investigations can slow down teams and lead to burnout. In today’s AI-driven world, the demand for operational excellence has surged. It’s become evident that conventional human-only methods can’t keep up with the scale and complexity required for system upkeep, especially when the speed of code deployment has skyrocketed.
Simultaneously, we faced the challenge of integrating with the rapidly changing AI landscape. New models, tools, and best practices are constantly emerging, which can fragment ecosystems across various platforms for observability, DevOps, incident management, and security. Rather than just automating tasks, we aimed to create a flexible method that seamlessly integrates with existing systems and evolves over time.
To tackle these issues, Microsoft developed the Azure SRE Agent, an AI-driven operations agent designed to be an always-available partner for engineers. In practice, the Azure SRE Agent continuously monitors production environments to identify and explore incidents. It analyses signals like logs, metrics, code changes, and deployment records to perform root cause analysis. It supports engineers from triage to resolution and operates at various levels of autonomy—from assisting with investigations to automating solutions. All actions are conducted within governance frameworks and human approval processes, supported by role-based access controls and clear escalation paths. Importantly, the Azure SRE Agent learns from past incidents, outcomes, and human feedback to enhance its capabilities over time.
Azure SRE Agent was developed using the agentic workflow model—creating agents that work alongside other agents. Instead of viewing AI as an add-on tool, Microsoft embedded specialised agents throughout the software development life cycle (SDLC) to collaborate with developers from planning to operations.
The diagram above illustrates the agents used throughout the development process, coming together to create an integrated lifecycle:
- Plan & Code: Agents facilitate spec-driven development, enabling quicker cycles for developers and product managers. With AI, we can draft specification documents that outline requirements for user experience and software development agents, create prototypes, and check code into staging, allowing teams to rapidly iterate and refine code even in early stages.
- Verify, Test & Deploy: Agents for code quality, security evaluations, and deployment work together to proactively address quality and security concerns. They also consistently monitor reliability, ensure performance, and maintain best practices for releases.
- Operate & Optimize: The Azure SRE Agent manages ongoing operational tasks—from investigating alerts to aiding in remediation and resolving certain issues autonomously. Moreover, it continually learns and has its own specialised instance of the Azure SRE Agent to support maintenance and encourage feedback loops.
While agents provide insights, suggest actions, and autonomously mitigate issues, humans remain involved for oversight, approval, and decision-making when needed. This blend of autonomy and governance is vital for safe operations at scale. Additionally, we designed the Azure SRE Agent to integrate seamlessly with existing systems, employing custom agents, the Model Context Protocol (MCP), Python tools, telemetry connections, incident management platforms, code repositories, knowledge sources, and operational tools to enhance workflows without replacing them.
In this way, the Azure SRE Agent isn’t merely a new tool, but a breakthrough operational system. At Microsoft’s scale, transformative systems can lead to remarkable outcomes.
The tangible benefits of the Azure SRE Agent are evident in everyday operations. By automating incident investigations and supporting remediation efforts, the agent eases the load on on-call engineers and speeds up resolution times.
Internally at Microsoft over the past nine months, we’ve seen:
- 35,000+ incidents managed autonomously by the Azure SRE Agent.
- 50,000+ developer hours saved by reducing manual investigations and response time.
- Teams have experienced a lower on-call burden and quicker mitigation times during incidents.
Two noteworthy examples of success with the Azure SRE Agent come from the Azure Container Apps and Azure App Service product teams. Engineers from Azure Container Apps reported an exceptional 89% positive feedback on root cause analyses generated by the agent, which addressed over 90% of incidents. Meanwhile, the Azure App Service team has decreased their time-to-mitigation for live-site incidents to just 3 minutes, a remarkable drop from the previous average of 40.5 hours when relying solely on human efforts.
When asked about the changes brought by the agent, one of our engineers remarked:
“The agent has been a tremendous help with quota requests, which were initially handled manually. I’m confident that there have been numerous recommended corrective actions that were spot-on and provided helpful clues, streamlining my initial investigation. Instead of exhausting time exploring various possibilities, the agent has already narrowed it down, allowing me to pick up efficiently and save countless hours of log checking.”
– Software Engineer II, Microsoft Engineering
Beyond the advantages of the agent itself, the agentic workflow has redefined our entire development approach.
It’s easy to simplify agents as just another layered automation, but it’s crucial to recognise that the Azure SRE Agent also serves as a collaborative tool. Engineers can utilise the agent during investigations to reveal pertinent context (logs, metrics, and related code changes), allowing for quicker and easier troubleshooting than traditional methods. Furthermore, they extend it for data analysis and dashboard creation, enabling engineers to focus on the agent’s insights, whether to approve actions or intervene as needed. This results in a partnership between humans and AI that enhances operational expertise without losing control.
Though this process required time and experimentation to perfect, the results have been remarkable; our team is producing high-quality features more swiftly than ever since the introduction of specialised agents at each stage of the SDLC. While these achievements occurred within Microsoft, the underlying principles apply broadly.
Firstly, building agents with agents is vital for scaling, as manual development quickly creates bottlenecks; agents significantly fast-track inner loop iterations through code generation, review, debugging, and security fixes. Regarding agents, specialisation is key, as generic agents reach a plateau quickly. The notable impact arises from agents equipped with domain-specific skills, context, and the right tools and data access.
Secondly, Microsoft learned the importance of deep integration with existing systems, embedding agents within established telemetry, workflows, and platforms instead of attempting to replace them. Throughout this journey, maintaining close human governance has proven essential. Balancing autonomy with clear approval boundaries, role-based access, and safety checks is crucial for building trust.
Lastly, teams have discovered the value of investing in continuous feedback and assessment, using ongoing metrics to refine agents and identify where automation provides real benefits versus where human judgement remains indispensable.
The Azure SRE Agent exemplifies how agentic workflows can revolutionise both product development and large-scale operations. Teams at Microsoft are committed to leading the industry by example and encourage you to apply these practical lessons in your own environments.
FAQs
- What is the Azure SRE Agent? It is an AI-powered operational agent that monitors production environments and helps engineers quickly address incidents.
- How does the Azure SRE Agent improve operational efficiency? By automating incident investigations and remediation, it reduces the burden on on-call engineers and accelerates resolution times.
- What are the benefits of using agentic workflows? They enhance collaboration, streamline troubleshooting, and allow for a more efficient development process.
- Can the Azure SRE Agent integrate with existing systems? Yes, it is designed to work seamlessly with established workflows and systems without replacing them.
Share this content:
Discover more from Qureshi
Subscribe to get the latest posts sent to your email.