ai

Microsoft launches MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 on Microsoft Foundry

Microsoft's new in-house AI models for speech, voice, and image generation are now in public preview on Microsoft Foundry — here's what they do and who they're for.

ai category

Microsoft has put three new AI models into public preview on Microsoft Foundry: MAI-Transcribe-1 for speech recognition, MAI-Voice-1 for speech generation, and MAI-Image-2 for text-to-image generation. All three come from Microsoft’s internal MAI Superintelligence team, led by Microsoft AI CEO Mustafa Suleyman, which was formed just six months ago with an explicit goal of making Microsoft less reliant on third-party AI providers.

These are not research models or experimental prototypes. They are already running inside Microsoft’s own products — Copilot, Bing, Teams, and PowerPoint — and are now available for developers and enterprises to use directly via Foundry.

MAI-Transcribe-1: Speech recognition built for production

MAI-Transcribe-1 is Microsoft’s first-generation enterprise speech-to-text model. It uses a transformer-based text decoder paired with a bi-directional audio encoder, accepts MP3, WAV, and FLAC files up to 200MB, and covers 25 languages.

The headline numbers: it achieves the lowest Word Error Rate against current competing models on the FLEURS benchmark, ranking first in 11 of the top 25 global languages. It beats Scribe v2, Whisper large-V3, GPT-Transcribe, and Gemini 3.1 Flash-Lite. Batch transcription speed runs at 2.5x the existing Microsoft Azure Fast offering. GPU cost comes in at roughly 50% lower than comparable alternatives.

Pricing starts at $0.36 per hour.

For developers, MAI-Transcribe-1 slots into workflows covering subtitle generation, podcast transcription, meeting archives, compliance recording, call centre QA, and legal discovery. Diarisation, contextual biasing, and streaming support are listed as coming soon, so if those are critical requirements for you, it’s worth keeping an eye on the roadmap before committing.

MAI-Voice-1: Fast, expressive text-to-speech

MAI-Voice-1 generates 60 seconds of natural-sounding audio in under one second on a single GPU. That is a meaningful engineering achievement for production workloads where latency and infrastructure cost matter.

The model preserves speaker identity across long-form content, which makes it suitable for audiobooks, podcast production, and voice assistants that need consistency over extended sessions. It currently powers Copilot’s Audio Expressions and podcast features.

The more interesting developer capability is custom voice creation. Through the Personal Voice feature in Azure Speech, you can clone a voice from as little as a 10-second audio sample. This goes through an approval process aligned with Microsoft’s responsible AI policies, so it is not instant, but it is a low-friction path to a bespoke voice for your product.

Pricing starts at $22 per 1 million characters.

Combined with MAI-Transcribe-1 and a language model of your choice, MAI-Voice-1 gives you the building blocks for a full voice pipeline: speech in, text processed, speech out.

MAI-Image-2: Text-to-image with a strong benchmark debut

MAI-Image-2 debuted at number three on the Arena.ai leaderboard when it was first released on MAI Playground in March 2026. It is now available on Microsoft Foundry, and Microsoft reports at least 2x faster generation times compared to the previous image generation offering, based on production traffic data.

The model focuses on areas that matter in real creative workflows: photorealistic image generation, accurate text rendering within images (a persistent weak point for many models), and handling of complex layouts and detailed scenes. It outputs images up to 1024x1024 pixels in a 1:1 format currently, with a parameter range in the 10B to 50B range and a maximum prompt context of 32K tokens.

Pricing starts at $5 per 1M tokens for text input and $33 per 1M tokens for image output.

MAI-Image-2 is already rolling out to Bing and PowerPoint, and is the model behind the improved image generation in Copilot.

The bigger picture: Microsoft building its own AI stack

The context here matters. In November 2025, Suleyman announced the formation of the MAI Superintelligence team with a stated goal of AI self-sufficiency. Microsoft’s renegotiated partnership with OpenAI, finalised in 2025, removed a contractual restriction that had previously prevented Microsoft from developing its own broadly capable models. Microsoft has since publicly stated its intention to build a frontier-class general-purpose LLM by 2027.

Suleyman has been careful to frame this as additive rather than a pivot away from OpenAI: “Nothing’s changing with the OpenAI partnership. We will be in partnership with them at least until 2032 and hopefully a lot longer.” The MAI models are specialist systems covering specific modalities, not a replacement for GPT-class models in the near term.

That said, MAI-Transcribe-1 does directly target the transcription workload that OpenAI’s Whisper models have long owned in the open-source space. The benchmark results also show it outperforming Google’s Gemini 3.1 Flash-Lite on 22 of 25 languages. Microsoft is clearly not building these models to sit quietly in the background.

What this means for you

If you are a developer building on Azure, these models are available now through Microsoft Foundry. You can also try them first in MAI Playground before committing to a full integration. The transcription and voice models are both available there; MAI-Image-2 has been live on Playground since March.

If you are running enterprise workloads, the Foundry deployment comes with the governance, guardrails, and compliance controls you would expect from Azure infrastructure. These models have been red-teamed and are running at scale in Microsoft’s own products, which is a reasonable proxy for production readiness.

If you are a Microsoft 365 or Copilot user, you may already be benefiting from these models without realising it. MAI-Transcribe-1 is rolling out to Copilot Voice mode and Teams for conversation transcription. MAI-Voice-1 is behind Copilot’s podcast and Audio Expressions features. MAI-Image-2 is the engine behind the improved image generation in Copilot and Bing.

For teams that have been running Whisper or third-party transcription APIs, MAI-Transcribe-1 is worth a serious evaluation. The accuracy numbers are competitive, the pricing is straightforward, and the Azure integration removes a layer of infrastructure management. The missing diarisation and streaming support will be blockers for some use cases, but for batch transcription at scale, it is ready to use today.