Accelerating open-source infrastructure development for frontier AI at scale
Microsoft is setting new benchmarks across various fields like power, cooling, sustainability, security, networking, and fleet resilience to drive innovation forward.
As we shift from developing cloud-scale computing infrastructure to crafting AI and cloud systems designed for frontier-scale use, we’ve witnessed massive innovations. Throughout this evolution, Microsoft has generously shared its insights and best practices, enhancing our cloud infrastructure through collaborative initiatives like the Open Compute Project (OCP) Global Foundation.
Currently, we’re poised for a crucial phase of cloud infrastructure innovation. In just the past year, Microsoft has increased its capacity by over 2 gigawatts and rolled out the world’s most powerful AI datacentre, outpacing the performance of today’s fastest supercomputer by ten times. And this is just the beginning.
To provide AI infrastructure that achieves top performance at the lowest cost, a systematic approach is essential. Optimising every layer of our stack ensures quality, speed, and reliability for our customers’ experiences. As we strive to deliver resilient, sustainable, secure, and scalable technology for diverse AI workloads, we are embarking on an ambitious journey to redefine infrastructure innovation at every level—from silicon to systems—and to establish tightly integrated industry standards that pave the way for global interoperability.
At this year’s OCP Global Summit, Microsoft is proud to contribute new standards in power, cooling, sustainability, security, networking, and fleet resilience to further propel industry innovation.
Rethinking Power Distribution for the AI Era
As AI workloads expand worldwide, hyperscale datacentres are facing unmatched challenges in power density and distribution.
Last year at the OCP Global Summit, we joined forces with Meta and Google to develop a groundbreaking power architecture known as Mt. Diablo. This year, we’re advancing this innovation through solid-state transformers. These devices simplify the power chain by employing new conversion technologies and protective measures that can meet future rack voltage demands.
Training large models across thousands of GPUs brings variable and intense power consumption patterns that can put a strain on both the grid and traditional power delivery systems. These fluctuations not only threaten hardware reliability and operational efficiency, but they also complicate capacity planning and sustainability goals.
In collaboration with key industry partners, Microsoft is spearheading a power stabilization initiative to tackle these challenges. A recent paper co-authored with OpenAI and NVIDIA—Power Stabilization for AI Training Datacenters—details how innovations spanning hardware, firmware orchestration, telemetries, and facility integration can lessen power spikes, decrease power overshoot by 40%, and reduce operational risks. This paves the way for predictable, scalable power delivery for AI training clusters.
At this year’s OCP Global Summit, we’ll launch a dedicated power stabilization workgroup alongside industry partners. We strive for open collaboration among hyperscalers and hardware partners, sharing insights gained from our system-wide innovations and inviting the community to co-create new methodologies addressing the distinct power challenges faced by AI training datacentres. By utilizing findings from our recent white paper, we hope to accelerate the industry-wide adoption of resilient, scalable power delivery solutions for the upcoming generation of AI infrastructure. Find out more about our power stabilization efforts.
Cooling Innovations for Enhanced Resilience
As AI infrastructure power demands evolve, we’re rethinking our cooling systems to accommodate evolving energy consumption, space optimization, and overall sustainability. We’re implementing various cooling solutions to support our growth—while building new AI-scale datacentres, we’re also employing Heat Exchanger Unit (HXU)-based liquid cooling to swiftly expand AI capacity within our existing air-cooled datacentre framework.
Microsoft’s next-generation HXU will soon debut as part of OCP, allowing liquid cooling for high-performance AI systems within air-cooled datacentres, aiding global scalability and rapid deployment. The modular HXU design offers double the performance of current models and maintains over 99.9% cooling service availability for AI workloads. No modifications to the datacentre are needed, ensuring seamless integration and expansion. Discover more about the next generation HXU here.
Additionally, we are continually innovating at multiple layers to manage power and heat dissipation—implementing facility water cooling on a datacentre scale, circulating liquid in closed-loops from servers to chillers, and exploring on-chip cooling technologies such as microfluidics to effectively eliminate heat from silicon components.
Unified Networking Solutions for Increasing Infrastructure Demands
As we scale hundreds of thousands of GPUs to operate cohesively, significant challenges arise in creating rack-scale interconnects that provide low-latency, high-bandwidth networks that are efficient and interoperable. With the rapid growth in AI workloads and infrastructure needs, we are exploring networking optimisations to support these requirements. This includes developing solutions that leverage scale-up, scale-out, and Wide Area Network (WAN) technologies for large-scale distributed training.
We maintain close partnerships with standards organizations like UEC (Ultra Ethernet Consortium) and UALink, focusing on innovation in networking for these critical elements of AI systems. We’re also promoting the adoption of Ethernet for scale-up networking across the sector and are thrilled to announce the launch of the Ethernet for Scale-up Networking (ESUN) workstream within the OCP Networking Project. We look forward to advancing adoption of cutting-edge networking technologies and enabling a multi-vendor ecosystem based on open standards.
Security, Sustainability, and Quality: Essential Foundations for Robust AI Operations
Defense in Depth: Trust in Every Layer
Our holistic approach to responsibly scaling AI systems places trust and security at the core of our platform. This year, we’re introducing new security contributions that enhance our established hardware security work and introduce new protocols tailored to support the scientific breakthroughs accelerated by AI:
- Building on our previous collaborations with AMD, Google, and NVIDIA, we have further enhanced Caliptra, our open-source silicon root of trust. The newly released Caliptra 2.1 expands the hardware root of trust into a fully-fledged security subsystem. Learn more about Caliptra 2.1 here.
- We’ve also integrated Adams Bridge 2.0 into Caliptra to extend support for quantum-resilient cryptographic algorithms within the root-of-trust.
- Additionally, we are contributing OCP Layered Open-source Cryptographic Key Management (L.O.C.K)—a key management framework for storage devices that secures media encryption keys at the hardware level. L.O.C.K was developed collaboratively by Google, Kioxia, Microsoft, Samsung, and Solidigm.
Promoting Sustainability at Datacentre Scale
Sustainability remains a significant opportunity for industry collaboration and standardization, particularly through initiatives like the Open Compute Project. Working together as a community of hyperscalers and hardware partners is vital in meeting the demand for sustainable datacentre infrastructure capable of adapting as computing needs evolve. This year, we’re excited to continue our collaborations as part of OCP’s Sustainability workgroup, focusing on areas like carbon measurement, reporting, and circularity:
- Announced at this year’s Global Summit, we’re partnering with AWS, Google, and Meta to fund the Product Category Rule initiative within the OCP Sustainability workgroup, aiming to standardise carbon measurement methodologies for devices and datacentre equipment.
- Together with Google, Meta, OCP, Schneider Electric, and the iMasons Climate Accord, we are establishing the Embodied Carbon Disclosure Base Specification to create a common framework for reporting the carbon impact of datacentre equipment.
- Microsoft is pushing for the use of waste heat reuse (WHR). In partnership with the NetZero Innovation Hub, NREL, and collaborators in Europe and the US, we’ve released heat reuse reference designs and developed an economic modelling tool that helps datacentre operators evaluate the costs involved in developing waste heat reuse infrastructure, taking into account factors such as size, capacity, seasonal variations, location, and existing regulations. These region-specific solutions assist operators in converting surplus heat into usable energy, helping to comply with regulations and unlock additional capacity, especially in regions like Europe where heat reuse is becoming mandatory.
- We have created an open methodology for Life Cycle Assessment (LCA) at scale across extensive IT hardware fleets, aiming for a “gold standard” in sustainable cloud infrastructure.
Rethinking Node Management: Enhancing Fleet Resilience for the Frontier Era
As AI infrastructure scales rapidly, Microsoft is dedicating resources to standardising how diverse compute nodes are deployed, updated, monitored, and serviced in hyperscale datacentres. Collaborating with AMD, Arm, Google, Intel, Meta, and NVIDIA, we are rolling out a series of contributions to the Open Compute Project (OCP) focused on streamlining fleet operations, unifying firmware management, improving manageability interfaces, and enhancing diagnostics and reliability. This standardized approach lays a solid foundation for consistent and scalable node operations during this period of rapid growth. Discover more about our approach to resilient fleet operations.
Paving the Path for Frontier-Scale AI Computing
As we embark on a new era of frontier-scale AI development, Microsoft is proud to lead the charge toward establishing standards that will drive the future of globally deployable AI supercomputing. Our commitment is evident in our active role in shaping an ecosystem that enables scalable, secure, and reliable AI infrastructure across the globe. We invite attendees of this year’s OCP Global Summit to visit Microsoft at booth #B53 where they can explore our latest cloud hardware demonstrations. These showcases highlight our ongoing collaborations within the OCP community, spotlighting innovations that support the evolution of AI and cloud technologies.


