← all repositories
jingyaogong/minimind-o

A 0.1B-parameter model that listens, sees, and speaks—trained on one GPU

MiniMind-O is a from-scratch Omni implementation small enough to train in ~2 hours on a single RTX 3090, designed for developers who want to understand the full pipeline rather than download a black box.

minimind-o
Velocity · 7d
+47
★ / day
Trend
steady
star history

What it does

MiniMind-O is an end-to-end multimodal model that accepts text, speech, or images and outputs both text and streaming speech. The backbone is just ~0.1B parameters (115M), with an MoE variant at ~0.3B. It avoids the typical ASR→LLM→TTS cascade by routing speech and text through shared hidden states, then generating audio directly via multi-token prediction of Mimi codec layers.

The interesting bit

The architecture splits into a “Thinker” path (understanding and text generation) and a “Talker” path (streaming speech synthesis), with VAD support for real-time barge-in and near-duplex interaction. The author claims this is the smallest fully open-source Omni implementation, and the educational intent is explicit: every core algorithm is written in raw PyTorch without high-level framework abstractions.

Key highlights

  • Single RTX 3090, ~2 hours to train the mini dataset through the full Thinker–Talker pipeline
  • Raw PyTorch implementation of projectors, MTP audio heads, and training loops; no dependency on trainer frameworks
  • Ships with mini and full datasets, pretrained encoders (SenseVoice-Small, SigLIP2, Mimi), and a Gradio WebUI with voice-cloning and “phone mode”
  • Compatible with both native PyTorch weights and HuggingFace transformers format
  • Published technical report with architecture diagrams, training curves, and CER/WER evaluations

Caveats

  • The 0.1B scale is explicitly educational; quality and capability tradeoffs against larger Omni models are expected
  • WebUI setup requires manually copying the transformers-format model into ./scripts/ before launch
  • Training data and several encoder weights are hosted on ModelScope with HuggingFace mirrors; access may vary by region

Verdict

Grab this if you want to trace every tensor in an Omni model from __init__ to forward(), or if you need a hackable baseline for voice+vision experiments. Skip it if you need production-grade speech quality or are looking for a drop-in API replacement for GPT-4o.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.