26M parameters, one job: call your functions
A distilled Gemini 3.1 that fits on watches and glasses, finetunable on a laptop.

What it does
Needle is a 26-million-parameter encoder-decoder transformer distilled from Gemini 3.1. It takes a natural-language query plus a JSON schema of available tools, and emits a structured function call. That’s it. No chat, no creative writing — just “turn off the lights” → {"name":"toggle_lights","arguments":{"state":"off"}}.
The interesting bit
The architecture itself is the experiment. The team calls it a “Simple Attention Network”: 12 encoder layers (no FFN, just self-attention) feed cross-attention into 8 decoder layers, with tied embeddings and a custom ZCRMSNorm. The bet is that most on-device AI doesn’t need a generalist model — it needs a reliable router, and 26M params is enough if you scope the task tightly.
Key highlights
- Runs at 6,000 tok/s prefill and 1,200 tok/s decode on the Cactus runtime (claimed, not independently verified)
- Weights and training data fully open on HuggingFace
- One-command finetuning via
needle playgroundweb UI or CLI; auto-generates synthetic training data with Gemini - Beats FunctionGemma-270M, Qwen-0.6B, Granite-350M, and LFM2.5-350M on single-shot function call (per their benchmarks)
- Pretrained 200B tokens in 27 hours on 16 TPU v6e; post-trained 2B tokens in 45 minutes
Caveats
- The README itself warns that “small models can be finicky” and overfit badly below ~120 examples per tool
- Explicitly not a conversational model — larger models “excel in conversational settings” where this one will flounder
- Speed claims are tied to the Cactus runtime, not standard PyTorch or llama.cpp
Verdict
Grab this if you’re building a smartwatch, glasses, or phone assistant that needs to route voice commands to hardcoded tools and can’t afford a 1B+ model. Skip it if you need chit-chat, open-ended reasoning, or want to run inference without the Cactus stack.