← all repositories
OpenBMB/VoxCPM

A TTS model that skips the tokenization step entirely

VoxCPM2 generates speech directly from text using continuous diffusion, no discrete audio tokens required.

VoxCPM
Velocity · 7d
+104
★ / day
Trend
steady
star history

What it does VoxCPM2 is a 2B-parameter text-to-speech model that takes a different path from most modern TTS systems: it never tokenizes audio into discrete codes. Instead, it uses a diffusion autoregressive architecture to generate continuous speech representations end-to-end. The result is a system that handles 30 languages, voice cloning, and voice design from a single model, built on the MiniCPM-4 backbone.

The interesting bit The “tokenizer-free” approach is the real architectural bet here. Most neural TTS pipelines (and audio LLMs generally) rely on some form of discrete vocabulary—VQ-VAE tokens, SoundStream codes, or similar. VoxCPM2 bypasses that entirely, which the team claims enables more natural prosody and expressiveness. The asymmetric AudioVAE V2 also upsamples 16kHz reference inputs to 48kHz output natively, no external super-resolution model needed.

Key highlights

  • 30 languages with no language tag required, plus Chinese dialect support (Cantonese, Sichuanese, Shanghainese/Wu, and others)
  • Voice Design: generate a speaker from text description alone—gender, age, emotion, pace—no reference audio
  • Controllable Cloning: clone from a short clip, then steer style with natural-language instructions
  • Ultimate Cloning: provide reference audio plus its transcript for continuation-style cloning that preserves timbre, rhythm, and emotion
  • Production paths: standard PyTorch (~0.3 RTF on RTX 4090), Nano-vLLM (~0.13 RTF), or vLLM-Omni with PagedAttention and OpenAI-compatible /v1/audio/speech endpoint
  • Apache-2.0 license, weights and code fully open for commercial use

Caveats

  • Requires Python 3.10–3.12, PyTorch ≥2.5.0, CUDA ≥12.0—fairly modern stack
  • vLLM-Omni integration is “rapidly evolving” and currently requires building from source
  • The “over 2 million hours” training data claim and specific quality benchmarks aren’t independently verified in the README

Verdict Worth a look if you’re building TTS into a product and want one model that covers multilingual synthesis, voice cloning, and speaker design without juggling tokenizers or upsamplers. Less interesting if you’re already committed to an existing token-based pipeline and don’t need the voice-design features.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.