Maia 200: The AI accelerator built for inference
We’re excited to unveil the Maia 200, an innovative inference accelerator designed to significantly enhance the cost-effectiveness of AI token generation. Maia 200 is a stellar AI inference device built on TSMC’s cutting-edge 3nm process. It comes equipped with native FP8/FP4 tensor cores, a revamped memory system boasting 216GB of HBM3e at 7 TB/s, and 272MB of on-chip SRAM. Its advanced data movement engines ensure that large models are efficiently fed and kept running at high capacity. This makes the Maia 200 the most powerful, proprietary silicon from any large service provider, achieving three times the FP4 performance of the latest Amazon Trainium generation and surpassing Google’s seventh-generation TPU in FP8 performance. It is also the most efficient inference system Microsoft has deployed yet, offering a 30% improvement in performance per dollar compared to our most recent hardware.
Part of our heterogeneous AI infrastructure, Maia 200 supports multiple models, including the latest GPT-5.2 from OpenAI. This brings a performance advantage to Microsoft Foundry and Microsoft 365 Copilot. The Superintelligence team at Microsoft will utilise Maia 200 for generating synthetic data and reinforcement learning, enhancing the development of next-generation in-house models. For applications requiring synthetic data pipelines, Maia 200’s distinctive design accelerates the generation and filtering of high-quality, domain-specific data, providing fresh, targeted signals for downstream training.
Maia 200 is currently operational in our US Central datacentre located near Des Moines, Iowa, with another facility planned for US West 3 in Phoenix, Arizona, and more regions to follow. It integrates seamlessly with Azure, and we are currently previewing the Maia SDK, which provides a comprehensive suite of tools for building and optimising models for Maia 200. This includes PyTorch integration, a Triton compiler, an optimised kernel library, and access to Maia’s low-level programming language. Developers can enjoy detailed control when necessary while easily porting models across various hardware accelerators.
Designed for AI Inference
Produced using TSMC’s advanced 3-nanometer technology, each Maia 200 chip contains over 140 billion transistors. It’s specifically crafted for handling extensive AI workloads while providing efficient performance per dollar. Maia 200 is optimised for the latest models utilising low-precision computing, boasting over 10 petaFLOPS of 4-bit (FP4) performance and more than 5 petaFLOPS at 8-bit (FP8) all within a 750W system on a chip (SoC) thermal design power. In real-world terms, Maia 200 can smoothly support today’s largest models, with room to spare for even larger models in the future.
Performance in FLOPS isn’t the only factor for boosting AI capabilities. It’s also vital to efficiently manage data flow. Maia 200 tackles this issue head-on with a state-of-the-art memory subsystem. This system uses narrow-precision data types, a specialised DMA engine, on-chip SRAM, and an optimised NoC fabric for high-bandwidth data movement, thereby enhancing token throughput.

Optimised AI Systems
On the systems level, Maia 200 features an innovative two-tier scale-up network design built around standard Ethernet. Its custom transport layer and closely integrated NIC unlock superior performance, reliability, and significant cost benefits without the need for proprietary fabrics.
Every accelerator provides:
- 2.8 TB/s of dedicated, bidirectional scaleup bandwidth
- Reliable, high-performance collective operations across clusters with up to 6,144 accelerators
This architecture ensures scalable performance for dense inference clusters while decreasing energy consumption and total cost of ownership across Azure’s global infrastructure.
Within each tray, four Maia accelerators are interconnected using direct, non-switched links, maintaining high-bandwidth communication locally to enhance inference efficiency. The same communication protocols are employed across both intra-rack and inter-rack networking with the Maia AI transport protocol, allowing effortless scaling across nodes, racks, and clusters with minimal network delays. This integrated fabric streamlines programming, enhances workload flexibility, and minimises idle capacity while delivering stable performance and cost-effectiveness at a cloud scale.

A Cloud-Native Development Approach
A key principle of Microsoft’s silicon development projects is to validate as much of the entire system as possible before the final silicon is available.
We utilised a sophisticated pre-silicon environment to shape the Maia 200 architecture from the beginning, accurately modelling the computation and communication patterns of large language models. This collaborative environment allowed us to optimise silicon, networking, and system software as a cohesive unit long before the first silicon was ready.
From the outset, Maia 200 was designed for rapid, seamless integration in the datacentre, and extensive validation was carried out on some of the most complex system elements, including the backend network and our second-generation liquid cooling Heat Exchanger Unit. Native integration with the Azure control plane provides security, telemetry, diagnostics, and management features at both the chip and rack levels, ensuring maximum reliability and uptime for mission-critical AI workloads.
Thanks to these investments, AI models were operational on Maia 200 silicon just days after the first packaged parts arrived. The time from the initial silicon to the first datacentre rack deployment was cut to less than half that of comparable AI infrastructure programs. This comprehensive approach—covering chip design, software integration, and datacentre deployment—leads to enhanced utilisation, quicker time to production, and sustained improvements in performance per dollar and per watt at a cloud scale.

Join the Maia SDK Preview
The era of large-scale AI is just beginning, and the infrastructure we create will determine what’s achievable. Our Maia AI accelerator program is designed to be multi-generational. As we roll out Maia 200 across our global infrastructure, we’re already working on future iterations, with the expectation that each new generation will set fresh standards for effectiveness and efficiency in handling major AI workloads.
Today, we invite developers, AI startups, and researchers to start experimenting with model and workload optimisation using the new Maia 200 software development kit (SDK). The SDK includes a Triton Compiler, support for PyTorch, low-level programming in NPL, a Maia simulator, and a cost calculator to enhance efficiency earlier in the coding lifecycle. To sign up for the preview, click here.
Visit our Maia 200 site for more photos, videos, and resources, or read more details.
Scott Guthrie oversees hyperscale cloud computing solutions and services, including Azure, Microsoft’s cloud platform, alongside generative AI solutions, data platforms, and cybersecurity solutions. These offerings aid organisations around the globe in tackling urgent challenges and driving long-term changes.
Share this content:


