A faster Ollama for Apple Silicon, with receipts
Rapid-MLX is an OpenAI-compatible local LLM server optimized for M-series Macs, claiming 2-4× speedups over Ollama and llama.cpp.

What it does
Rapid-MLX wraps Apple’s MLX framework into a drop-in OpenAI API replacement. Install via Homebrew or pip, run rapid-mlx serve, and point Cursor, Claude Code, Aider, or any OpenAI-compatible client at localhost:8000/v1. It handles model downloads, quantization, tool calling, prompt caching, and even vision/audio models through optional extras.
The interesting bit
The project ships a “Model-Harness Index” (MHI) — a weighted score combining tool-calling accuracy, HumanEval coding tasks, and MMLU knowledge retention — to tell you which model actually works with which agent framework. This is the boring compatibility matrix made useful: Qwopus 27B scores 92 across all tested harnesses, while Gemma 4 26B hits 100% tool calling with Hermes but 0% on HumanEval.
Key highlights
- Claims 160 tok/s on a 16 GB MacBook Air (Qwen3.5-4B) and up to 141 tok/s for a 30B model on 32 GB machines
- 17 tool parsers with “100% tool calling” on several model+harness combinations per MHI tables
- 0.08s cached TTFT (time-to-first-token) with prompt cache support
- One-command setup for popular agents:
rapid-mlx agents opencode --setupwires OpenCode automatically - Optional vision (~322 MB extra) and audio extras via mlx-vlm and mlx-audio
Caveats
- macOS-only; Apple Silicon required (M1-M4 supported)
- Python 3.10+ required — macOS still ships 3.9, so expect version headaches if not using Homebrew
- The “2-4× faster than Ollama” claim is stated but no independent benchmark methodology is shown in the README
Verdict
Mac developers already running local LLMs who are hitting Ollama’s speed ceiling should evaluate this. Windows or Linux users, and anyone without at least 16 GB unified memory, need not apply.