A 1.3B vision model that outruns smaller rivals on your phone
MiniCPM-V shrinks multimodal understanding to pocket scale with aggressive visual token compression and full-duplex audio-video streaming.

What it does
MiniCPM-V is a family of multimodal language models built for edge deployment. The flagship MiniCPM-V 4.6 handles image, video, and text understanding in 1.3B parameters, while MiniCPM-o 4.5 adds real-time audio input, speech output, and full-duplex streaming conversation at 9B. Both target phones and low-power devices rather than data-center GPUs.
The interesting bit
The compression strategy is the real engineering story. MiniCPM-V 4.6 uses intra-ViT early compression from LLaVA-UHD v4 to cut visual encoding cost by over 50%, with mixed 4x/16x token compression rates chosen per-task. The team also ships open-source edge adaptation code for iOS, Android, and HarmonyOS—rarely seen from research labs.
Key highlights
- 1.3B-parameter MiniCPM-V 4.6 claims ~1.5x token throughput vs. Qwen3.5-0.8B despite being larger
- MiniCPM-o 4.5 supports full-duplex streaming: simultaneous video/audio input with speech/text output, plus proactive interactions like reminders
- Public free API key available for MiniCPM-V 4.6; hosted API also covers MiniCPM-o 4.5
- Official support in llama.cpp, vLLM, LLaMA-Factory; Ollama and SGLang integration in progress
- Real-time web demo deployable locally on Mac or GPU
Caveats
- Web demo may experience latency issues due to network conditions; Docker image for local deployment still pending as of the README date
- Some framework integrations (Ollama, SGLang for newer models) require using project forks until upstream PRs merge
Verdict
Mobile and embedded developers who need on-device vision understanding without cloud round-trips should evaluate this seriously. If you’re only running batch inference on A100s, the efficiency tricks are less relevant—though the full-duplex streaming in MiniCPM-o 4.5 may still interest real-time application builders.