Microsoft's voice AI that swallows hour-long podcasts whole
A research family of ASR and TTS models built on the bet that voice should be processed as long-form narrative, not chopped into seconds-long shards.

What it does VibeVoice is a collection of speech models from Microsoft Research tackling both directions of the voice pipeline: transcription and synthesis. The ASR model ingests up to 60 minutes of audio in a single pass, outputting structured transcripts with speaker labels and timestamps. The TTS side can generate up to 90 minutes of multi-speaker audio (4 distinct voices), while a lightweight 0.5B streaming variant targets ~300ms first-audio latency for real-time use.
The interesting bit The architecture sidesteps the usual chunk-and-stitch approach by running continuous speech tokenizers at a glacial 7.5 Hz frame rate. An LLM backbone handles context and dialogue flow; a diffusion head renders the acoustic details. The result is a model that treats a podcast or meeting as one coherent sequence rather than a bag of 30-second clips.
Key highlights
- ASR processes 60-minute audio within a 64K token context, doing speaker diarization and timestamping jointly rather than as post-processing
- Custom hotword injection for domain-specific vocabulary (names, technical terms)
- TTS supports 4-speaker conversations with turn-taking across 90-minute generations
- Realtime-0.5B model adds streaming text input and ~10 minute generation at deployment-friendly size
- ASR now available via Hugging Face Transformers; vLLM inference supported for faster throughput
- 50+ languages supported in ASR; TTS covers English, Chinese, and others
Caveats
- The full VibeVoice-TTS code was removed from the repository in September 2025 after misuse was detected; model weights remain on Hugging Face but the training/inference code is gone
- Microsoft explicitly states this is “intended for research and development purposes only” and does not recommend commercial deployment without further testing
- Inherits biases and errors from its Qwen2.5 1.5B base model
Verdict Worth exploring if you’re building meeting transcription, podcast generation, or long-form audio pipelines and want to experiment with end-to-end context rather than pipeline glue. Skip if you need production TTS code today—the source is offline, and the license posture is cautious.