← all repositories
OpenMOSS/MOSS-TTS-Nano

A 100M-parameter voice cloner that runs on your laptop CPU

MOSS-TTS-Nano shrinks real-time speech synthesis down to something you can actually ship without a GPU farm.

3.4k stars Python Image · Video · Audio
MOSS-TTS-Nano
Velocity · 7d
+58
★ / day
Trend
steady
star history

What it does

MOSS-TTS-Nano is a 0.1B-parameter text-to-speech model that generates 48 kHz stereo audio and clones voices from a short reference clip. It speaks 20 languages, streams output with low latency, and targets deployment scenarios where “works on my MacBook Air” is the actual requirement. The project ships with PyTorch inference scripts, a FastAPI web demo, and a packaged CLI.

The interesting bit

The team went further than just “small model.” They released a fully standalone ONNX CPU version that drops the PyTorch dependency entirely at inference time, reportedly hits nearly 2x the processing efficiency, and runs smoothly on a single CPU core on a MacBook Air M4. There’s even a browser extension (MOSS-TTS-Nano-Reader) that runs the ONNX model locally without a separate backend service. The architecture is a pure autoregressive pipeline: an audio tokenizer feeds tokens into a small LLM, which then decodes back to audio.

Key highlights

  • 0.1B parameters, 48 kHz 2-channel output, 20 languages including Chinese, English, Japanese, Arabic, and others
  • Voice cloning from reference audio with automatic chunking for long inputs
  • ONNX CPU version removes PyTorch dependency; optional CUDA backend if you have the GPU
  • CLI commands moss-tts-nano generate and moss-tts-nano serve for terminal and server use
  • Finetuning code released; models auto-download from Hugging Face on first run
  • Also supports mlx-audio for Apple Silicon users

Caveats

  • Setup can trip on WeTextProcessing/pynini dependencies; the README documents a workaround and points to a community-tested wheel (Issue #6)
  • The “2x faster” and “single core” claims are self-reported benchmarks from the team, not independent verification
  • MOSS-TTS 2.0 is teased as coming soon, so the current version may be a stepping stone

Verdict

Worth a look if you need voice cloning or multilingual TTS in a resource-constrained environment — edge devices, browser extensions, or cheap VPS hosting. Skip it if you need studio-grade prosody or are already invested in a larger GPU-backed pipeline that satisfies your latency requirements.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.