Loading Now

Microsoft’s New In‑House AI Models (MAI‑Transcribe, MAI‑Voice, MAI‑Image)

Meet MAI‑Transcribe‑1, Microsoft’s inaugural in-house model for speech recognition. This model is designed to work with 25 languages and excels in transcribing real-world, noisy audio environments, making it perfect for settings like meetings and call centres.

Key Features

  • Top-tier transcription accuracy suitable for enterprise use
  • Built for understanding multilingual and accented speech
  • More affordable GPU costs compared to older Azure speech models

Introducing MAI‑Voice‑1, a high-quality voice generation model that delivers natural, engaging speech while maintaining the speaker’s identity, even in lengthy audio formats.

Key Features

  • Creates up to 60 seconds of audio in about 1 second
  • Allows for custom voice creation
  • Optimised for use in voice agents and conversational interfaces

Then there’s MAI‑Image‑2, the cream of the crop when it comes to Microsoft’s text-to-image capabilities. It’s already being used in leading production Copilot experiences.

Key Features

  • Generates highly detailed, photorealistic images
  • Accurately renders text within images
  • Designed for production-ready speed and cost efficiency

If you’re an Azure developer, here’s how this launch will transform your work:

  1. First-party AI stack
    You can now create speech, voice, and image-related tasks without depending on outside AI providers.
  2. Enterprise-ready by default
    These models come with Azure RBAC, Managed Identity, compliance, and governance through Microsoft Foundry.
  3. Agent-first design
    MAI models are crafted to be integrated into AI agents, rather than just being simple APIs you call upon.

Here’s a typical enterprise architecture that employs MAI models.

Example Code for MAI‑Transcribe‑1:

Sample code for MAI-Transcribe-1

Microsoft’s MAI models offer much more than just new endpoints — they signify a significant shift in how Azure developers can create multimodal and agent-based AI solutions.

AspectBefore MAI (Azure & External Models)After MAI (MAI‑Transcribe, Voice, Image)
Model OwnershipHeavily reliant on third-party models (including OpenAI, external TTS/STT providers)First-party Microsoft-built models optimised and managed by Microsoft
Enterprise IntegrationAI models were integrated into AzureAI models are now native to Microsoft Foundry
Governance & ComplianceVarious controls depending on the model providerUnified governance under Azure RBAC, Entra ID, Purview, and Managed Identity
Agent ReadinessPrimarily single-request / single-response APIsDesigned for agent-oriented, long-running workflows
Cost PredictabilityToken-based or varied pricing modelsEnterprise-optimised pricing offering good value for performance
Operational ConsistencyDifferent SDKs, APIs, and quotasSingle Foundry toolset and SDK interface

Share this content:


Discover more from Qureshi

Subscribe to get the latest posts sent to your email.

Discover more from Qureshi

Subscribe now to keep reading and get access to the full archive.

Continue reading