From Copilots to Coworkers: How AI Agents Are Transforming Azure Networking Operations
Microsoft’s Customer Zero blog series provides a behind-the-scenes look at how Microsoft utilises its enterprise-grade IQ platform. Discover best practices from our engineering teams, along with real-world insights, architectural designs, and operational strategies aimed at building, running, and scaling AI applications and agent fleets across our organisation.
Azure boasts one of the largest physical networks globally. This vast scale not only influences infrastructure choices but also transforms how operational tasks are organised.
With hundreds of thousands of kilometres of external fibre and over a million optical devices, we connect data centres and regions while supporting Microsoft’s global services. Every interaction a customer has with Azure hinges on this physical network functioning reliably and quickly.
As the network expands, the nature of operational challenges has evolved. While detection, monitoring, and traffic rerouting have become highly automated and efficient, the real difficulty lies in what follows: coordinating physical repairs, tracking progress across various systems and vendors, validating outcomes, and ensuring workflows continue until issues are fully resolved.
In earlier operational models, the demand for coordination outpaced teams’ ability to adapt effectively. The constraint was no longer about routing or processing signals; it became about the amount of human focus required to maintain alignment across distributed tasks over extended periods.
This is where much of the operational effort accumulates. When issues arise, the overhead increases as more specialists need to come together:
- Coordinating field operations, hardware replacements, and incident resolution can become more time-consuming, especially when multiple companies and regions are involved.
- Engineers often find themselves waiting for updates, following up on tasks, validating fixes, and interpreting information across various systems.
Unlike past scale challenges that could be tackled through automation as code, the “messy middle” of operations is inherently unpredictable. It’s characterised by subjective decision-making, incomplete information, and asynchronous dependencies. At Azure’s scale, effective coordination becomes the critical bottleneck.
Rather than simply adding more scripts or expanding fragile automation, we rethought how coordination is approached by integrating AI agents as essential participants in our daily operations. Initially, we viewed agents purely as tools, but we’ve advanced to embed them within the system itself.
This transformation wasn’t instantaneous; we had to adapt our approach gradually:
- We kicked off with conversational copilots that enabled engineers and technicians to check device status and telemetry using natural language, making troubleshooting smoother.
- Over time, we introduced autonomous workflow agents that can take initiative and manage specific operational processes all the way through.
Autonomous workflow agents function like digital teammates, working alongside more than 10,000 employees such as data centre technicians, network engineers, and hardware engineers. These agents are assigned specific goals and can carry context over hours or even days to ensure tasks, from fibre repairs to orchestrating data centre deployment, reach completion. Essentially, they’re exceptional execution engines designed to reduce the cognitive load on humans, only relying on them for high-risk decision-making.
Agents operate alongside engineers and technicians in operational channels, including ticket queues, telemetry systems, Teams, and email. This integration keeps them closely engaged in the same workflows. As we refine processes and collect feedback, we continuously manage and curate knowledge banks for the agents to rely on during daily operations. We organise operational data, runbooks, and institutional knowledge into coherent segments for agents to act upon with greater consistency. Coupled with Work IQ and Fabric IQ, these frameworks enhance responses through connections to organisational context as tasks progress.
Our autonomous workflow agents are structured within an agent organisation, functioning akin to digital colleagues, and governed by an internal control system based on defined identity, roles, skills, policies, and auditability.
Roles and permissions for agents vary by class and risk level, rather than being uniform. However, agent permissions never override human accountability and oversight, which remain paramount. Humans continue to define goals, policies, and success criteria. Any high-risk or irreversible changes require explicit authorisation from a human expert. If agents encounter ambiguity or edge cases, they escalate the issue for a human decision rather than guessing. Ultimately, agents operate within the framework we set, ensuring humans maintain control over critical actions affecting vital components. Our experts still play a crucial role in policy decisions and system design, now with a sharper focus and reduced distractions.
This strategy of establishing safeguards also helps us find the right balance between agent scalability and cost. Consider the agent organisation as an auditable, policy-driven inventory. Some agents may be active for extended periods to perform scheduled checks and detect anomalies, whereas others can be activated on-demand as issues arise. This ensures a direct correlation between incident numbers and the quantity of active agents at any given time.
For instance, let’s examine a real situation involving a fibre break in Azure’s infrastructure in Southeast Asia. Once detected, an autonomous agent was created with complete context surrounding the incident. The agent communicated via email and Teams with a regional fibre supplier and field technicians in multiple languages, requesting updates at specific intervals and validating repair efforts against real-time telemetry. When the first technician fix failed, the agent promptly escalated the matter with detailed feedback. Technicians were then able to re-attempt the fix informed by the agent’s insights, who confirmed restoration to all involved once successful testing was completed. This entire process unfolded within the same systems and communication channels used by our employees, drastically reducing response times and making record-keeping straightforward.
This workflow comprised around 14 interactions over roughly 9.5 hours, without requiring a human engineer to manage each step. While engineers retained accountability for decisions and outcomes, the coordination progressed steadily without manual follow-ups or handovers. This model maintains ownership but alters how our workforce manages and supervises operations.
Several notable changes occur when agents become the primary coordination layer for incidents, repairs, and workflows:
- Coordinating efforts across vendors, regions, and systems becomes seamless and continuous.
- Updates are validated against real-time telemetry, rather than being relied upon blindly.
- Failures in actions are identified promptly and retried until success criteria are achieved.
- Long-term incidents caused by delayed handoffs are significantly diminished.
This leads to transformational outcomes that empower agents to assist our teams in scaling work to unprecedented levels:
- Doubling the speed for mitigating issues such as fibre repair workflows.
- A reduction of up to 78% in manual tasks by transferring operational burdens to agents.
Human and agent efforts occur in tandem within the same channels, ensuring information parity and expedient handovers. Engineers remain part of the process, but they no longer need to micromanage every task. Instead, they focus on guiding outcomes, intervening in complex scenarios, and developing system responses over time.
Ultimately, embedding agents into our daily operations is about developing learning systems. Beyond just responding reactively, agents function as a second set of eyes to identify recurring issues and subtle signals that can shape both day-to-day operations and future network and datacentre designs. Each cycle through feedback loops strengthens the network while enhancing the agents’ ability to resolve and prevent problems.
From our experiences, we’ve gained several insights into designing, operating, and governing global-scale systems:
Begin with successful conversational agents, then evolve them to execute actions. Agents become highly effective when they retain context over time, persist with a task, and close the loop without constant prompting. This minimises waiting and handoffs while maintaining clear ownership.
Engineers should design agents to operate in commonplace, accessible channels. At a large scale, simple coordination is essential for achieving timely, correct outcomes. Integrating agents within the same communication and record-keeping systems ensures actions are faster, smoother, and well-documented.
Clearly define guardrails and organisational policies for agents from the start. Well-defined approval processes and role-specific permissions empower agents to act confidently within known limits. Humans continue to be accountable for decisions that carry long-term repercussions.
Evaluate impact where it counts in operations. The most telling signals are those we’ve always aimed for: quicker mitigation, fewer stalled incidents, and reduced repair times. To assess the impact of agents, monitor how much of the process they autonomously handle en route to achieving these results.
We’re continuously refining this model with a commitment to responsible scaling, which includes governance, building trust, and maintaining cost-effectiveness.
Over time, this approach is nurturing systems that not only recover more rapidly but also learn from operational insights, feeding those lessons back into how we design and operate our networks. Ultimately, we’re creating a system that possesses a new level of autonomy in managing and repairing itself.
The overarching takeaway isn’t just about a specific platform or product. It’s about the possibilities that arise when AI agents work alongside humans in real-world operational systems. By leveraging agents to alleviate operational burdens, humans can coordinate their expertise more swiftly across broader areas while still maintaining control over direction and outcomes.
For more insights into agents and Azure Networking, feel free to explore further:
Share this content:
Discover more from Qureshi
Subscribe to get the latest posts sent to your email.