A 0.1B-parameter model that listens, sees, and speaks—trained on one GPU
MiniMind-O is a from-scratch Omni implementation small enough to train in ~2 hours on a single RTX 3090, designed for developers who want to understand the full pipeline rather than download a black box.

What it does
MiniMind-O is an end-to-end multimodal model that accepts text, speech, or images and outputs both text and streaming speech. The backbone is just ~0.1B parameters (115M), with an MoE variant at ~0.3B. It avoids the typical ASR→LLM→TTS cascade by routing speech and text through shared hidden states, then generating audio directly via multi-token prediction of Mimi codec layers.
The interesting bit
The architecture splits into a “Thinker” path (understanding and text generation) and a “Talker” path (streaming speech synthesis), with VAD support for real-time barge-in and near-duplex interaction. The author claims this is the smallest fully open-source Omni implementation, and the educational intent is explicit: every core algorithm is written in raw PyTorch without high-level framework abstractions.
Key highlights
- Single RTX 3090, ~2 hours to train the mini dataset through the full Thinker–Talker pipeline
- Raw PyTorch implementation of projectors, MTP audio heads, and training loops; no dependency on trainer frameworks
- Ships with mini and full datasets, pretrained encoders (SenseVoice-Small, SigLIP2, Mimi), and a Gradio WebUI with voice-cloning and “phone mode”
- Compatible with both native PyTorch weights and HuggingFace transformers format
- Published technical report with architecture diagrams, training curves, and CER/WER evaluations
Caveats
- The 0.1B scale is explicitly educational; quality and capability tradeoffs against larger Omni models are expected
- WebUI setup requires manually copying the transformers-format model into
./scripts/before launch - Training data and several encoder weights are hosted on ModelScope with HuggingFace mirrors; access may vary by region
Verdict
Grab this if you want to trace every tensor in an Omni model from __init__ to forward(), or if you need a hackable baseline for voice+vision experiments. Skip it if you need production-grade speech quality or are looking for a drop-in API replacement for GPT-4o.