← all repositories
QwenLM/Qwen3-TTS

Alibaba's TTS engine talks fast, clones faster, and actually listens

A 1.7B-parameter speech model that streams its first audio packet after a single character and takes voice design instructions in plain English—or Chinese, or nine other languages.

11.8k stars Python Image · Video · Audio
Qwen3-TTS
Velocity · 7d
+85
★ / day
Trend
steady
star history

What it does Qwen3-TTS is a family of text-to-speech models (0.6B and 1.7B parameters) that generates speech from text, clones voices from a 3-second audio clip, and designs entirely new voices from natural-language descriptions. It supports ten languages and comes with a Python package, vLLM integration, and a local web UI.

The interesting bit The architecture skips the now-typical diffusion-transformer (DiT) bottleneck entirely. Instead it uses a discrete multi-codebook language model fed by a custom 12Hz tokenizer, which the team claims eliminates cascading errors and lets the same model handle both streaming and batch generation. The advertised end-to-end latency for streaming is 97ms—low enough that you could plausibly use it in a real-time voice agent without a separate “fast” model.

Key highlights

  • Dual-mode generation: one model weights handles both streaming and non-streaming; first audio packet emits after a single input character
  • Voice design via prompt: describe age, gender, dialect, or emotion in natural language; the 1.7B variants support instruction control, the 0.6B variants do not
  • Tokenizer encode/decode: the 12Hz tokenizer is exposed directly, so you can compress speech to discrete codes and reconstruct it—useful for storage or downstream manipulation
  • Packaging is sane: pip install qwen-tts, standard HuggingFace from_pretrained loading, optional FlashAttention 2, and explicit vLLM serving instructions
  • Commercial API available: DashScope endpoint documented alongside the open weights

Caveats

  • The README recommends a fresh Python 3.12 environment and warns about dependency conflicts; FlashAttention 2 compilation can be memory-hungry (they suggest MAX_JOBS=4 under 96GB RAM)
  • “Other models mentioned in the technical report will be released in the near future”—so the current lineup may not be the full picture

Verdict Worth a look if you’re building voice agents, audiobook pipelines, or anything that needs low-latency TTS with controllable prosody. Skip it if you need a battle-tested, widely community-debugged alternative today; the project is fresh enough that rough edges are still being discovered.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.