Is faster-qwen3-tts open source?

Yes — andimarafioti/faster-qwen3-tts is open source, released under the MIT license.

What language is faster-qwen3-tts written in?

andimarafioti/faster-qwen3-tts is primarily written in Python.

How popular is faster-qwen3-tts?

andimarafioti/faster-qwen3-tts has 1.2k stars on GitHub.

Where can I find faster-qwen3-tts?

andimarafioti/faster-qwen3-tts is on GitHub at https://github.com/andimarafioti/faster-qwen3-tts.

← all repositories

andimarafioti/faster-qwen3-tts

Qwen3-TTS hits real time with nothing but CUDA graphs

It drags Qwen3-TTS into real-time territory without touching Flash Attention, vLLM, or Triton.

★1.2k stars Python Image · Video · Audio Inference · Serving

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

Wraps the Qwen3-TTS speech-synthesis models (0.6B and 1.7B) in a fast inference path built on torch.cuda.CUDAGraph. The result is a drop-in replacement that yields audio chunks as they are generated, supporting voice cloning, preset speakers, and instruction-based voice design. It also ships with an OpenAI-compatible API server and a minimal web UI.

The interesting bit

The author deliberately ignored the standard acceleration toolkit—no Flash Attention, no vLLM, no Triton kernels—and still pulled RTF numbers above 1.0 on hardware ranging from a Jetson AGX Orin to an RTX 4090. The trick is replaying captured CUDA graphs for both the predictor and talker at every step, which turns out to be enough to outrun the baseline by roughly 2–10× on most cards.

Key highlights

Benchmarked speedups range from about 2× to almost 10× depending on GPU and model size, turning sub-real-time baselines into live-ready throughput.
Supports three generation modes: voice cloning from reference audio, predefined CustomVoice speakers, and instruction-driven VoiceDesign.
Streams audio via pull-based generators with configurable chunk sizes; the CUDA graphs themselves do not change between streaming and batch paths.
Includes an OpenAI-compatible /v1/audio/speech endpoint and a minimal web UI that displays live TTFA and RTF metrics.
Handles the 12 Hz codec’s causal decoder context automatically in voice-clone mode so the generated voice actually matches the reference.

Caveats

PyTorch versions below 2.5.1 can fail during graph capture with an “operation not permitted when stream is capturing” error; the README explicitly sets 2.5.1+ as the floor.
RTX 50-series / Blackwell GPUs need a separate CUDA 12.8 PyTorch wheel, which is noted but not automatically handled.
The default voice-clone mode prepends the reference audio into the model context (ICL), which can cause a brief artifact at the start because the model literally continues the reference sentence.

Verdict

Worth a look if you are running Qwen3-TTS locally and need latency low enough for interactive use, especially on edge NVIDIA hardware. Skip it if you are married to a non-NVIDIA stack or need a fully polished, zero-config deployment.

Frequently asked

What is andimarafioti/faster-qwen3-tts?: It drags Qwen3-TTS into real-time territory without touching Flash Attention, vLLM, or Triton.
Is faster-qwen3-tts open source?: Yes — andimarafioti/faster-qwen3-tts is open source, released under the MIT license.
What language is faster-qwen3-tts written in?: andimarafioti/faster-qwen3-tts is primarily written in Python.
How popular is faster-qwen3-tts?: andimarafioti/faster-qwen3-tts has 1.2k stars on GitHub.
Where can I find faster-qwen3-tts?: andimarafioti/faster-qwen3-tts is on GitHub at https://github.com/andimarafioti/faster-qwen3-tts.