A 100M-parameter voice cloner that runs on your laptop CPU
MOSS-TTS-Nano shrinks real-time speech synthesis down to something you can actually ship without a GPU farm.

What it does
MOSS-TTS-Nano is a 0.1B-parameter text-to-speech model that generates 48 kHz stereo audio and clones voices from a short reference clip. It speaks 20 languages, streams output with low latency, and targets deployment scenarios where “works on my MacBook Air” is the actual requirement. The project ships with PyTorch inference scripts, a FastAPI web demo, and a packaged CLI.
The interesting bit
The team went further than just “small model.” They released a fully standalone ONNX CPU version that drops the PyTorch dependency entirely at inference time, reportedly hits nearly 2x the processing efficiency, and runs smoothly on a single CPU core on a MacBook Air M4. There’s even a browser extension (MOSS-TTS-Nano-Reader) that runs the ONNX model locally without a separate backend service. The architecture is a pure autoregressive pipeline: an audio tokenizer feeds tokens into a small LLM, which then decodes back to audio.
Key highlights
- 0.1B parameters, 48 kHz 2-channel output, 20 languages including Chinese, English, Japanese, Arabic, and others
- Voice cloning from reference audio with automatic chunking for long inputs
- ONNX CPU version removes PyTorch dependency; optional CUDA backend if you have the GPU
- CLI commands
moss-tts-nano generateandmoss-tts-nano servefor terminal and server use - Finetuning code released; models auto-download from Hugging Face on first run
- Also supports mlx-audio for Apple Silicon users
Caveats
- Setup can trip on
WeTextProcessing/pyninidependencies; the README documents a workaround and points to a community-tested wheel (Issue #6) - The “2x faster” and “single core” claims are self-reported benchmarks from the team, not independent verification
- MOSS-TTS 2.0 is teased as coming soon, so the current version may be a stepping stone
Verdict
Worth a look if you need voice cloning or multilingual TTS in a resource-constrained environment — edge devices, browser extensions, or cheap VPS hosting. Skip it if you need studio-grade prosody or are already invested in a larger GPU-backed pipeline that satisfies your latency requirements.