What’s new with Microsoft in open-source and Kubernetes at KubeCon + CloudNativeCon Europe 2026
Technology often undergoes a familiar evolution. Initially, teams can experiment freely, choosing various tools, frameworks, and problem-solving approaches. This flexibility may seem beneficial at first, but as the project scales, it often leads to fragmentation.
Simply adding capabilities isn’t the solution; what’s truly needed is a unified operational philosophy. Kubernetes highlighted this principle. It didn’t only address the question of “how do we run containers?” but also tackled “how do we implement changes safely in active systems?” The community developed these essential patterns, strengthened them, and established them as standard practices.
Currently, AI infrastructure is still in a somewhat chaotic phase. The transition from distinguishing between “working” and “broken” systems to evaluating “good answers” versus “bad answers” presents a unique operational challenge. This won’t be resolved just through more tools; instead, it requires an open-source collaborative effort to create shared interfaces and harness community momentum, leading to well-documented and repeatable practices.
That’s our direction. Since my last update during KubeCon + CloudNativeCon North America 2025, we’ve continued to invest in open-source AI infrastructure, networking, observability, storage, multi-cluster operations, and cluster lifecycle management. At the upcoming KubeCon + CloudNativeCon Europe 2026 in Amsterdam, we’re excited to announce progress towards bringing Kubernetes’ operational maturity into the modern workloads of today.
Creating an Open Source Foundation for AI on Kubernetes
The merging of AI with Kubernetes infrastructure reveals that the challenges in both areas are increasingly interconnected. Much of our recent work has focused on establishing foundational tools that make GPU-supported workloads a vital part of the cloud-native landscape.
On the scheduling front, Microsoft has partnered with industry leaders to promote open standards for managing hardware resources. Some key achievements this cycle include:
- Dynamic Resource Allocation (DRA) is now widely available, including the release of the DRA example driver and DRA Admin Access.
- Workload Aware Scheduling for Kubernetes 1.36 incorporates DRA support within the Workload API and integrates with KubeRay, simplifying how developers can request and manage high-performance infrastructure for training and inference.
- DRANet has gained compatibility for Azure RDMA Network Interface Cards (NICs), enhancing network resource management for high-performance hardware, thereby optimizing training performance.
Beyond scheduling, we’ve been improving the tools essential for deploying, managing, and securing AI workloads on Kubernetes:
- AI Runway is a fresh open-source initiative offering a shared Kubernetes API for inference workloads, streamlining model deployment management. It comes with a user-friendly web interface so that those who aren’t familiar with Kubernetes can deploy a model easily, along with built-in HuggingFace model discovery, GPU memory fit indicators, real-time cost estimates, and compatibility with NVIDIA Dynamo, KubeRay, llm-d, and KAITO.
- HolmesGPT has joined the Cloud Native Computing Foundation (CNCF) as a Sandbox project, adding automatic troubleshooting features to the shared cloud-native tooling ecosystem.
- Dalec, a newly-integrated CNCF project, outlines declarative specifications for crafting system packages and producing minimal container images. It supports SBOM generation and provenance verification, helping organizations reduce vulnerabilities when managing AI workloads on a large scale.
- Cilium received significant contributions from Microsoft this cycle, including native mTLS ztunnel support for secure workload communication, Hubble metrics cardinality controls for managing observability costs, and flow log aggregation to conserve storage volume, alongside two merged Cluster Mesh Cilium Feature Proposals (CFPs) aimed at advancing cross-cluster networking.
What’s New in Azure Kubernetes Service
In addition to our upstream efforts, I’m pleased to reveal fresh enhancements in Azure Kubernetes Service (AKS), focusing on networking and security, observability, multi-cluster operations, storage, and lifecycle management for clusters.
Transitioning from IP-based Controls to Identity-aware Networking
As Kubernetes deployments become more distributed, relying on IP-based networking becomes increasingly complicated: visibility diminishes, security policies become cumbersome to audit, and ensuring encrypted workload communication historically required either a complete service mesh or considerable custom development. This cycle, our networking updates aim to streamline this by shifting security and traffic intelligence to the application layer, where they have real impact and are easier to manage.
Azure Kubernetes Application Network provides teams with mutual TLS, application-specific authorization, and detailed traffic analysis for both ingress and in-cluster communications, complete with built-in multi-region connectivity. The benefit is enhanced identity-aware security and genuine traffic insights without the complexity of a full-service mesh. For those dealing with the transition away from ingress-nginx, Application Routing with Meshless Istio offers a standards-friendly approach: support for Kubernetes Gateway API without sidecars, maintaining compatibility with existing ingress-nginx setups, and contributions to ingress2gateway for incremental transitions.
At the data plane level, WireGuard encryption within the Cilium data plane secures node-to-node communications efficiently without the need for adjustments to applications. Additionally, Cilium mTLS in Advanced Container Networking Services extends security to pod-to-pod communication using X.509 certificates and SPIRE for identity management: ensuring authenticated, encrypted communication without sidecars. Finally, Pod CIDR expansion lifts a long-standing operational limitation, allowing clusters to grow their pod IP ranges on the fly instead of needing a complete rebuild, and administrators can now disable HTTP proxy variables for nodes and pods without altering control plane settings.
Visibility That Matches the Complexity of Modern Clusters
To effectively manage Kubernetes at scale, it’s crucial to have clear visibility into infrastructure, networking, and workloads. We are working diligently to address two consistent visibility gaps: GPU telemetry and network traffic observability, which are vital as AI workloads enter production.
Historically, teams managing GPU workloads faced a major monitoring gap as GPU utilization data wasn’t integrated with standard Kubernetes metrics without manual exporter setups. Now, AKS directly showcases GPU performance and utilisation in managed Prometheus and Grafana, placing GPU metrics within the same monitoring stack teams typically use for capacity planning and alert notifications. Regarding network observability, we now have per-flow L3/L4 and supported L7 visibility across HTTP, gRPC, and Kafka traffic, along with details like IP addresses, ports, workloads, flow direction, and policy decisions available through a new Azure Monitor experience featuring built-in dashboards and streamlined onboarding. For teams facing the challenge of excessive metric data, operators can now selectively control the collection of container-level metrics using Kubernetes custom resources, keeping dashboards focused on actionable insights. Agentic container networking now includes a web interface that translates natural language inquiries into read-only diagnostics using real-time telemetry, accelerating the process from “there’s an issue” to “here’s how to address it.”
Simplified Operations Across Clusters and Workloads
For organizations managing workloads across multiple clusters, cross-cluster networking usually means custom configurations, inconsistent service discovery, and limited visibility across cluster boundaries. The Azure Kubernetes Fleet Manager now resolves this by providing cross-cluster networking via a managed Cilium cluster mesh, enabling unified connectivity across AKS clusters, a global service registry for cross-cluster service discovery, and intelligent routing managed centrally rather than requiring repetition per cluster.
On the storage front, clusters can now leverage storage from a shared Elastic SAN pool instead of managing individual disks for each workload. This simplifies capacity planning for stateful workloads with fluctuating demands and lessens provisioning challenges at scale.
For those who seek an easier way to interact with Kubernetes, AKS desktop is now fully available. It provides a complete AKS experience on your desktop, allowing developers to run, test, and refine Kubernetes workloads locally with the same configurations they’ll use in production.
Safer Upgrades and Quicker Recovery
The stakes are high when an upgrade goes wrong in production, and recovering from such issues can often be lengthy and stressful. This cycle has seen several updates aimed specifically at enhancing cluster changes—making them safer, more trackable, and easier to revert.
With blue-green agent pool upgrades, a parallel pool with the new configuration can be created rather than directly applying changes; this allows teams to validate functionality before redirecting traffic and keeps a clear backup option if something goes amiss. Similarly, agent pool rollback permits reverting a node pool to its earlier version and node image if issues arise post-upgrade—without a total rebuild. Together, these strategies empower operators with improved control over the upgrade lifecycle, avoiding scenarios of “upgrade and cross fingers” versus “stick with the old version.” To speed up the provisioning during scale-out scenarios, prepared image specifications enable teams to set up custom node images, preloading containers, OS settings, and initialization scripts, ensuring quicker startup and consistent environments for rapid, repeatable provisioning.
Meet the Microsoft Azure Team in Amsterdam
The Azure team is thrilled to participate in KubeCon + CloudNativeCon Europe 2026. Here are some highlights for connecting with the Azure team on-site:
- Rules of the Road for Shared GPUs: AI Inference Scheduling at Wayve—Customer keynote, Tuesday, March 24, 2026, 9:37 AM CET.
- Scaling Platform Ops with AI Agents: Troubleshooting to Remediation—Tuesday, March 24, 2026, 10:13 AM CET with Jorge Palma, Principal PDM Manager, Microsoft.
- Building Cross-Cloud AI Inference on Kubernetes with OSS—Wednesday, March 25, 2026, 1:15 PM CET with Jorge Palma and Anson Qian, Principal Software Engineer, Microsoft.
- Stop by our booth #200 for live demonstrations and chats with the Azure and AKS teams.
- Or check out the full schedule of sessions featuring Microsoft speakers throughout the event.
Enjoy your time at KubeCon + CloudNativeCon!
Share this content:
Discover more from Qureshi
Subscribe to get the latest posts sent to your email.