Particle News: Microsoft Launches In‑House Speech and Image AI Models on Foundry

Overview

Microsoft AI, which unveiled the models Thursday, made MAI‑Transcribe‑1, MAI‑Voice‑1 and MAI‑Image‑2 available to developers on the Foundry platform and in the US‑only MAI Playground.
Published prices start at $0.36 per hour for transcription, $22 per 1 million characters for voice, and $5 per 1 million input tokens plus $33 per 1 million image output tokens for image generation.
Microsoft says MAI‑Transcribe‑1 delivers the lowest average word error rate on the FLEURS benchmark across 25 languages and beats OpenAI’s Whisper‑large‑v3 and Google’s Gemini 3.1 Flash on many of those languages.
The company reports the same models already power Copilot, Bing, PowerPoint and Azure Speech, and it highlights speed gains such as MAI‑Voice‑1 generating 60 seconds of audio in under one second and MAI‑Transcribe‑1 running 2.5× faster than a prior Azure offering.
A 2025 contract change lets Microsoft build its own frontier models while keeping OpenAI access through 2032, and leaders tout small teams and about half the GPU cost versus rivals even as some gaps remain, including no speaker diarization in transcription and MAI‑Image‑2’s square‑only output.