← all repositories
SparkAudio/Spark-TTS

A Qwen2.5 that speaks: LLM-native text-to-speech

Spark-TTS treats voice synthesis as a language modeling problem, skipping the usual acoustic feature pipelines.

Spark-TTS
Velocity · 7d
+23
★ / day
Trend
steady
star history

What it does Spark-TTS is a text-to-speech system built entirely on Qwen2.5. Instead of chaining separate models for acoustic features and audio generation, it predicts speech tokens directly through the LLM and reconstructs audio from there. The repo provides inference code, a Gradio web UI, and a Triton/TensorRT-LLM runtime path for production deployment.

The interesting bit The “single-stream decoupled speech tokens” approach collapses the traditional TTS pipeline into one model. No flow matching, no vocoder juggling—just the LLM spitting out tokens that become waveform. The project also leans hard into zero-shot voice cloning with a gallery of celebrity demos that doubles as a responsible-use warning.

Key highlights

  • Built on Qwen2.5; 0.5B parameter model available on Hugging Face
  • Zero-shot voice cloning with cross-lingual and code-switching support (Chinese/English)
  • Controllable generation: adjust gender, pitch, and speaking rate for virtual speakers
  • Triton/TensorRT-LLM serving reference included; ~876ms latency at concurrency=1 on L20 GPU
  • Gradio web UI for voice cloning (upload or record reference) and voice creation

Caveats

  • Training code and dataset (VoxBox) not yet released; this is inference-only for now
  • Windows support exists only via community-contributed instructions in an issue thread
  • The demo gallery includes public figures with an ethics disclaimer, but no technical guardrails are described

Verdict Worth a look if you’re building TTS into a product and want to test whether LLM-native synthesis beats your current pipeline. Skip it if you need training code today or run primarily on Windows without community patches.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.