A speech lab that fits in your backpack
OpenMOSS built five specialized TTS models instead of one jack-of-all-trades, then made them run on CPUs and 8GB GPUs.

What it does
The MOSS-TTS Family is a collection of five production-ready speech and sound generation models from MOSI.AI and the OpenMOSS team. Each model handles a specific slice of audio generation: flagship voice cloning, multi-speaker dialogue, voice design from text prompts, real-time streaming for voice agents, and environmental sound effects. They can be used independently or wired together into a full pipeline.
The interesting bit
Rather than chasing a single universal model, the team split the problem by capability and inference constraint. The “Delay” architecture trades latency for long-context stability; the “Local” architecture targets streaming; and the “Realtime” variant keeps multi-turn conversational state across both text and acoustics. There’s even a ~100M-parameter “Nano” model that streams on four CPU cores. The PyTorch-free path — llama.cpp for the backbone, ONNX Runtime for the audio codec — is the kind of deployment pragmatism you rarely see in research releases.
Key highlights
- Five specialized models: MOSS-TTS (voice cloning), MOSS-TTSD (dialogue), MOSS-VoiceGenerator (prompt-to-voice), MOSS-TTS-Realtime (streaming agents), MOSS-SoundEffect (SFX generation)
- On-device deployment: PyTorch-free inference via llama.cpp + ONNX Runtime; 8B model fits on 8GB GPUs; Nano runs on 4 CPU cores
- Speed options: SGLang backend claims ~3× throughput for Delay architectures; Realtime TTFB is 180 ms
- Fine-grained control: Pinyin, phoneme, and duration editing; explicit pause syntax
[pause X.Ys]; multilingual/code-switched synthesis - Sound effects v2.0: DiT + Flow Matching backbone, 48 kHz, up to 30 seconds
Caveats
- The README makes strong benchmark claims (“industry-leading,” “outperformed top closed-source models”) but does not provide the actual numbers or evaluation protocols in the visible text
- Model architecture details are split across separate markdown files in subdirectories, so getting the full picture requires some digging
- The project is young and moving fast; v2.0 is already teased and the API surface may shift
Verdict
Worth a look if you’re building voice agents, audiobook pipelines, or need TTS that actually ships to edge devices. Skip if you want a single-model API with no assembly required.