A 64M-parameter LLM you can train from scratch for pocket change
MiniMind strips away framework magic so you actually see how transformers work, end to end.

What it does
MiniMind is a tiny language model—64M parameters, roughly 1/2700th the size of GPT-3—that you can train from literal scratch on a single GPU. The repo ships the full stack: tokenizer training, pretraining, supervised fine-tuning, LoRA, DPO, PPO/GRPO/CISPO reinforcement learning, tool use, even agentic multi-turn RL. Everything is implemented in raw PyTorch without leaning on transformers, trl, or peft abstractions.
The interesting bit
The author’s bet is that “using Lego to build a plane beats flying first class.” Most tutorials have you fine-tuning someone else’s giant model with ten lines of code; MiniMind makes you write the attention, the RoPE, the loss, and the rollout engine yourself. The “2 hours / 3 dollars” claim refers specifically to one SFT epoch on a rented RTX 3090—cheap enough to treat as a disposable science experiment.
Key highlights
- Dense and MoE architectures (current release aligns with Qwen3-style structure)
- Raw-PyTorch implementations of LoRA, DPO, PPO, GRPO, YaRN long-context extension, and model distillation
- Compatible with
llama.cpp,vllm,ollama, and OpenAI-style API servers for inference - Streamlit web UI with tool-calling and chain-of-thought display
- Spin-off repos add vision (MiniMind-V), omni (MiniMind-O), diffusion, and linear-attention variants
Caveats
- The “2 hours” headline is SFT-only; full pretraining from random weights takes longer (the README is vague on exactly how much)
- Breaking re-architecture in April 2025 abandoned the v1 model series; older checkpoints won’t load without weight remapping
- MoE and agentic RL code is present but the README offers no training cost or quality benchmarks for those modes
Verdict
Grab this if you want to touch the actual math behind LLMs without renting a GPU cluster. Skip it if you need a production model; 64M parameters is a toy, and the value is educational, not competitive.