Loading Now

Infinite scale: The architecture behind the Azure AI superfactory

We’re excited to announce the opening of our latest Azure AI datacenter in Atlanta, Georgia, marking a significant addition to our Fairwater initiative. This state-of-the-art datacenter connects seamlessly with our first Fairwater site in Wisconsin and other AI supercomputing facilities, forming the globe’s first AI superfactory on a planetary scale. By optimising computing power density, each Fairwater location is crafted to effectively tackle the soaring demand for AI capabilities, enhance model intelligence, and enable individuals and organisations worldwide to achieve greater success.

To cater to this burgeoning demand, we’ve completely transformed how we conceptualise AI datacenters and the systems housed within them. Fairwater breaks away from conventional cloud datacentre models, employing a flattened network that can incorporate hundreds of thousands of the latest NVIDIA GB200 and GB300 GPUs into a colossal supercomputer. These advancements arise from many years of expertise in designing datacentres and networks, combined with insights gained from supporting some of the world’s largest AI training projects.

This innovative Fairwater design accommodates not just the next wave of advanced models but also focuses on versatility. Training has progressed from rigid job structures to varied workloads with distinct requirements, including pre-training, fine-tuning, reinforcement learning, and synthetic data generation. Microsoft has introduced a specialised AI WAN backbone to connect each Fairwater site, enabling a broader, adaptable system that dynamically allocates diverse AI workloads and optimises GPU use across the entire setup.

Let’s explore some exciting technical advancements behind Fairwater, encompassing our datacentre construction methods and networking capabilities both within and between sites.

Maximising Compute Density

Modern AI infrastructure faces some fascinating physical limitations. For instance, latency caused by the speed of light acts as a significant hurdle when integrating accelerators, computing units, and storage. Fairwater is designed to make the most of compute density, reducing latency both within racks and among them for enhanced overall performance.

An essential factor in achieving higher density is our innovative cooling systems. The AI servers within Fairwater datacentres are linked to an extensive cooling system designed for durability. This closed-loop cooling mechanism continually recycles the liquid used after initial filling, producing minimal evaporation. The water used for the initial fill equals the annual consumption of 20 households and will only be replenished if necessary, ensuring an efficient and sustainable approach.

Liquid cooling significantly enhances heat transfer and enables us to boost rack and row-level power (~140kW per rack, 1,360 kW per row), allowing for maximum compute density within the datacentre. This cutting-edge cooling technology further optimises resource usage during steady-state operations, enabling large training tasks to run smoothly at impressive scales. Heat generated from the GPUs dissipates through one of the largest chiller facilities globally.

Rack level direct liquid cooling.

Another strategy to enhance compute density involves a two-storey datacentre layout. Since many AI tasks are sensitive to latency, longer cable runs can significantly impact cluster performance. In Fairwater, every GPU connects with one another, and the vertical design allows for three-dimensional rack placement, effectively minimising cable lengths which in turn reduces latency while improving bandwidth, reliability, and costs.

Two-story networking architecture
Two-storey networking architecture.

Reliable, Low-Cost Power

We’re also dedicated to providing efficient and dependable power for our computing needs. The Atlanta location was carefully chosen for its robust utility power capacity, achieving a remarkable 4×9 availability at 3×9 costs. By securing a consistently available grid power source, we can eliminate expensive traditional backup systems for our GPU fleet, resulting in cost savings for our customers while accelerating Microsoft’s time-to-market.

We’ve collaborated with industry leaders to develop power-management strategies that offset power fluctuations caused by extensive jobs. This is crucial in maintaining grid stability as demand for AI services continues to grow. Our solutions include software that introduces supplementary tasks during quieter periods, hardware that allows GPUs to manage their power caps, and on-site energy storage to reduce power inconsistencies without unnecessary power use.

State-of-the-Art Accelerators and Networking Technologies

The Fairwater datacentres are powered by specially designed servers, cutting-edge AI accelerators, and innovative networking systems. Each Fairwater facility operates as a cohesive cluster of interconnected NVIDIA Blackwell GPUs, with a sophisticated network architecture that outperforms traditional Clos networking limits, allowing hundreds of thousands of GPUs to connect on a single flat network. This advancement necessitated innovations in both scale-up and scale-out networking, as well as networking protocols.

When it comes to scaling up, each AI accelerator rack can house up to 72 NVIDIA Blackwell GPUs interconnected via NVLink, ensuring ultra-low-latency communication within the rack. These Blackwell accelerators deliver unmatched compute density, supporting low-precision number formats like FP4 for enhanced FLOPS and optimised memory usage. Furthermore, each rack provides 1.8 TB of GPU-to-GPU bandwidth, with over 14 TB of shared memory accessible to each GPU.

Densely populated GPU racks with app-driven networking
Densely populated GPU racks with app-driven networking.

These racks integrate with scale-out networking to form pods and clusters, allowing all GPUs to operate as a single supercomputer with minimal latency. We achieve this using a two-tier, ethernet-based backend, which supports vast clusters with 800 Gbps GPU-to-GPU connections. By relying on a broad ethernet ecosystem and SONiC (our proprietary operating system for network switches), we maintain flexibility and reduce costs by utilising common hardware instead of vendor-specific solutions.

Key enhancements in packet trimming, packet spraying, and high-frequency telemetry are integral to our optimised AI network. We’re also working on advancing the control and optimisation of network pathways. Together, these upgrades deliver improved congestion management, quick detection and retransmission, and effective load balancing, ensuring ultra-reliable, low-latency performance suitable for modern AI workloads.

Global Scale

Even with these impressive advancements, the demand for computing resources in large training tasks—now soaring into trillions of parameters—far exceeds the power and space limitations of any single facility. To address these challenges, we have built an AI WAN optical network that extends Fairwater’s scalable frameworks. By leveraging our extensive reach and years of hyperscale experience, we introduced over 120,000 new fibre miles across the United States last year, enhancing national AI network access and reliability.

This robust backbone allows us to interconnect different generations of supercomputers into an AI superfactory, surpassing the capabilities of a single facility across diverse geographic regions. This setup enables AI developers to utilise our extensive network of Azure AI datacentres, directing traffic according to their specific needs between the scale-up and scale-out frameworks, both within and across various sites via the continent-wide AI WAN.

This marks a significant shift from the previous model, which necessitated that all traffic use the scale-out network regardless of workload requirements. This evolution benefits customers by providing tailored networking options and increasing the flexibility and efficiency of our infrastructure.

Integrating the Innovations

The new Fairwater site in Atlanta signifies a pivotal advancement in Azure AI infrastructure, drawing on our experience in managing the largest AI training projects worldwide. It combines innovative approaches in compute density, sustainability, and networking, addressing the massive demand for computational capacity we’re encountering. Moreover, it integrates seamlessly with other AI datacentres and the broader Azure ecosystem to create the world’s first AI superfactory. These advancements offer a flexible and specialised infrastructure that can support the full spectrum of contemporary AI tasks, empowering every person and organisation globally to achieve remarkable outcomes. For our clients, this translates to simplified AI integration into all workflows and the opportunity to develop pioneering AI solutions that were once thought to be out of reach.

If you want to learn more about how Microsoft Azure can assist you in incorporating AI to enhance and streamline your development processes, click here.

Scott Guthrie oversees hyperscale cloud computing solutions and services, including Azure, Microsoft’s cloud platform, generative AI solutions, data services, and information security. These platforms and services empower organisations around the globe to conquer pressing challenges and drive long-lasting transformation.

Editor’s note: An update has been made to clarify our network optimisation techniques.

Tags: , , ,