← all repositories
microsoft/VibeVoice

Microsoft's voice AI that swallows hour-long podcasts whole

A research family of ASR and TTS models built on the bet that voice should be processed as long-form narrative, not chopped into seconds-long shards.

48.7k stars Python Image · Video · Audio
VibeVoice
Velocity · 7d
+170
★ / day
Trend
steady
star history

What it does VibeVoice is a collection of speech models from Microsoft Research tackling both directions of the voice pipeline: transcription and synthesis. The ASR model ingests up to 60 minutes of audio in a single pass, outputting structured transcripts with speaker labels and timestamps. The TTS side can generate up to 90 minutes of multi-speaker audio (4 distinct voices), while a lightweight 0.5B streaming variant targets ~300ms first-audio latency for real-time use.

The interesting bit The architecture sidesteps the usual chunk-and-stitch approach by running continuous speech tokenizers at a glacial 7.5 Hz frame rate. An LLM backbone handles context and dialogue flow; a diffusion head renders the acoustic details. The result is a model that treats a podcast or meeting as one coherent sequence rather than a bag of 30-second clips.

Key highlights

  • ASR processes 60-minute audio within a 64K token context, doing speaker diarization and timestamping jointly rather than as post-processing
  • Custom hotword injection for domain-specific vocabulary (names, technical terms)
  • TTS supports 4-speaker conversations with turn-taking across 90-minute generations
  • Realtime-0.5B model adds streaming text input and ~10 minute generation at deployment-friendly size
  • ASR now available via Hugging Face Transformers; vLLM inference supported for faster throughput
  • 50+ languages supported in ASR; TTS covers English, Chinese, and others

Caveats

  • The full VibeVoice-TTS code was removed from the repository in September 2025 after misuse was detected; model weights remain on Hugging Face but the training/inference code is gone
  • Microsoft explicitly states this is “intended for research and development purposes only” and does not recommend commercial deployment without further testing
  • Inherits biases and errors from its Qwen2.5 1.5B base model

Verdict Worth exploring if you’re building meeting transcription, podcast generation, or long-form audio pipelines and want to experiment with end-to-end context rather than pipeline glue. Skip if you need production TTS code today—the source is offline, and the license posture is cautious.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.