← all repositories
unum-cloud/UForm

Multimodal AI that fits in your pocket, not a data center

UForm shrinks image-text embeddings and chat models down to 79M–1.5B parameters so they run on phones, edge devices, and cheap VPSes without torching your latency budget.

UForm
Velocity · 7d
+1.0
★ / day
Trend
steady
star history

What it does UForm is a family of compact multimodal models from Unum Cloud. The embedding lineup (79M to 365M params) turns images and text into shared vectors for search and clustering across 21 languages. The generative side (1.2B–1.5B params) handles image captioning, visual Q&A, and chat. Everything ships with ONNX and CoreML exports, so you can bail on PyTorch at deployment time.

The interesting bit The embedding models are built for aggressive quantization: f32 → f16 → i8 → single-bit binary, with Matryoshka-style slicing that lets you search 64-dimensional “tiny” embeddings first and drill down only when needed. The authors claim 2–4× inference speedup over CLIP/LLaVA, and the quantization-aware design means you can actually act on that speed without your recall falling off a cliff.

Key highlights

  • Embedding models from 64 to 768 dims, generative chat models at ~1B params
  • Native ONNX/CoreML support; Swift, JS, and Python bindings
  • Matryoshka embeddings: slice a 768-dim vector down to 64 for coarse search
  • Quantization recipes provided (f16, i8, binary) with recall tradeoffs noted
  • Multilingual base model covers 21 languages; English variants for smaller footprint
  • Tight integration with USearch vector DB and SimSIMD distance kernels

Caveats

  • The “2–4× faster” and “5× faster than CLIP” claims are stated but not independently verified in the README; check BENCHMARKS.md if you’re sizing infrastructure
  • Video and long-document support are marked 🔜, i.e. not shipped yet
  • JavaScript and Swift docs for generative models are also 🔜
  • The original uform-gen model carries a ⚠️ warning; gen2 is the current path

Verdict Grab this if you’re building image search, product recommendations, or lightweight vision+LLM features where GPU clusters are overkill. Skip it if you need SOTA reasoning on complex multi-image prompts or guaranteed support for video/long docs today.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.