← all repositories
cactus-compute/needle

26M parameters, one job: call your functions

A distilled Gemini 3.1 that fits on watches and glasses, finetunable on a laptop.

needle
Velocity · 7d
+25
★ / day
Trend
steady
star history

What it does

Needle is a 26-million-parameter encoder-decoder transformer distilled from Gemini 3.1. It takes a natural-language query plus a JSON schema of available tools, and emits a structured function call. That’s it. No chat, no creative writing — just “turn off the lights” → {"name":"toggle_lights","arguments":{"state":"off"}}.

The interesting bit

The architecture itself is the experiment. The team calls it a “Simple Attention Network”: 12 encoder layers (no FFN, just self-attention) feed cross-attention into 8 decoder layers, with tied embeddings and a custom ZCRMSNorm. The bet is that most on-device AI doesn’t need a generalist model — it needs a reliable router, and 26M params is enough if you scope the task tightly.

Key highlights

  • Runs at 6,000 tok/s prefill and 1,200 tok/s decode on the Cactus runtime (claimed, not independently verified)
  • Weights and training data fully open on HuggingFace
  • One-command finetuning via needle playground web UI or CLI; auto-generates synthetic training data with Gemini
  • Beats FunctionGemma-270M, Qwen-0.6B, Granite-350M, and LFM2.5-350M on single-shot function call (per their benchmarks)
  • Pretrained 200B tokens in 27 hours on 16 TPU v6e; post-trained 2B tokens in 45 minutes

Caveats

  • The README itself warns that “small models can be finicky” and overfit badly below ~120 examples per tool
  • Explicitly not a conversational model — larger models “excel in conversational settings” where this one will flounder
  • Speed claims are tied to the Cactus runtime, not standard PyTorch or llama.cpp

Verdict

Grab this if you’re building a smartwatch, glasses, or phone assistant that needs to route voice commands to hardcoded tools and can’t afford a 1B+ model. Skip it if you need chit-chat, open-ended reasoning, or want to run inference without the Cactus stack.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.