A 99M-parameter TTS that runs on your e-reader
Supertonic squeezes multilingual text-to-speech into edge devices by shipping everything as optimized ONNX models with a small memory footprint.

What it does
Supertonic is an on-device text-to-speech system that synthesizes 44.1kHz WAV audio locally using ONNX Runtime. It supports 31 languages, requires no GPU, and targets everything from desktops to Raspberry Pis and e-readers. A Python SDK (pip install supertonic) auto-downloads model assets on first run, and a supertonic serve command exposes both native and OpenAI-compatible HTTP endpoints for local integration.
The interesting bit
The project bets on ONNX as the universal delivery mechanism: one 99M-parameter checkpoint, runtime examples in a dozen languages (Python, Rust, Swift, Go, Java, C++, C#, Node.js, Browser/WebGPU, Flutter, iOS), and a lang="na" mode that skips language detection entirely. That is unusual in a field where most open TTS models are 0.7B–2B parameters and cloud-dependent.
Key highlights
- 31 languages with a single model, no separate language adapters
- 10 inline expression tags (
<laugh>,<breath>,<sigh>) for prosodic control without reference audio - Voice Builder for creating permanent custom voice profiles from your own audio
- Competitive WER/CER against much larger models on the Minimax-MLS-test benchmark
- Batch inference support and quality/speed tradeoffs via
total_steps(5–12)
Caveats
- Model assets live on Hugging Face and require Git LFS; first setup involves cloning ~large files
- Per-language accuracy varies; the README shows some languages where Supertonic 3 lags behind larger competitors (e.g., Finnish CER at 5.40 vs OmniVoice’s 3.94)
- The “lightning fast” claim is stated but no concrete RTF or latency numbers are provided in the README
Verdict Worth a look if you need offline TTS in a resource-constrained or privacy-sensitive environment. Skip it if you need the absolute best quality for a single language and don’t mind cloud APIs or larger models.