← all repositories
OpenBMB/MiniCPM-V

A 1.3B vision model that outruns smaller rivals on your phone

MiniCPM-V shrinks multimodal understanding to pocket scale with aggressive visual token compression and full-duplex audio-video streaming.

MiniCPM-V
Velocity · 7d
+30
★ / day
Trend
steady
star history

What it does

MiniCPM-V is a family of multimodal language models built for edge deployment. The flagship MiniCPM-V 4.6 handles image, video, and text understanding in 1.3B parameters, while MiniCPM-o 4.5 adds real-time audio input, speech output, and full-duplex streaming conversation at 9B. Both target phones and low-power devices rather than data-center GPUs.

The interesting bit

The compression strategy is the real engineering story. MiniCPM-V 4.6 uses intra-ViT early compression from LLaVA-UHD v4 to cut visual encoding cost by over 50%, with mixed 4x/16x token compression rates chosen per-task. The team also ships open-source edge adaptation code for iOS, Android, and HarmonyOS—rarely seen from research labs.

Key highlights

  • 1.3B-parameter MiniCPM-V 4.6 claims ~1.5x token throughput vs. Qwen3.5-0.8B despite being larger
  • MiniCPM-o 4.5 supports full-duplex streaming: simultaneous video/audio input with speech/text output, plus proactive interactions like reminders
  • Public free API key available for MiniCPM-V 4.6; hosted API also covers MiniCPM-o 4.5
  • Official support in llama.cpp, vLLM, LLaMA-Factory; Ollama and SGLang integration in progress
  • Real-time web demo deployable locally on Mac or GPU

Caveats

  • Web demo may experience latency issues due to network conditions; Docker image for local deployment still pending as of the README date
  • Some framework integrations (Ollama, SGLang for newer models) require using project forks until upstream PRs merge

Verdict

Mobile and embedded developers who need on-device vision understanding without cloud round-trips should evaluate this seriously. If you’re only running batch inference on A100s, the efficiency tricks are less relevant—though the full-duplex streaming in MiniCPM-o 4.5 may still interest real-time application builders.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.