Alibaba's TTS engine talks fast, clones faster, and actually listens
A 1.7B-parameter speech model that streams its first audio packet after a single character and takes voice design instructions in plain English—or Chinese, or nine other languages.

What it does Qwen3-TTS is a family of text-to-speech models (0.6B and 1.7B parameters) that generates speech from text, clones voices from a 3-second audio clip, and designs entirely new voices from natural-language descriptions. It supports ten languages and comes with a Python package, vLLM integration, and a local web UI.
The interesting bit The architecture skips the now-typical diffusion-transformer (DiT) bottleneck entirely. Instead it uses a discrete multi-codebook language model fed by a custom 12Hz tokenizer, which the team claims eliminates cascading errors and lets the same model handle both streaming and batch generation. The advertised end-to-end latency for streaming is 97ms—low enough that you could plausibly use it in a real-time voice agent without a separate “fast” model.
Key highlights
- Dual-mode generation: one model weights handles both streaming and non-streaming; first audio packet emits after a single input character
- Voice design via prompt: describe age, gender, dialect, or emotion in natural language; the 1.7B variants support instruction control, the 0.6B variants do not
- Tokenizer encode/decode: the 12Hz tokenizer is exposed directly, so you can compress speech to discrete codes and reconstruct it—useful for storage or downstream manipulation
- Packaging is sane:
pip install qwen-tts, standard HuggingFacefrom_pretrainedloading, optional FlashAttention 2, and explicit vLLM serving instructions - Commercial API available: DashScope endpoint documented alongside the open weights
Caveats
- The README recommends a fresh Python 3.12 environment and warns about dependency conflicts; FlashAttention 2 compilation can be memory-hungry (they suggest
MAX_JOBS=4under 96GB RAM) - “Other models mentioned in the technical report will be released in the near future”—so the current lineup may not be the full picture
Verdict Worth a look if you’re building voice agents, audiobook pipelines, or anything that needs low-latency TTS with controllable prosody. Skip it if you need a battle-tested, widely community-debugged alternative today; the project is fresh enough that rough edges are still being discovered.