← all repositories
raullenchai/Rapid-MLX

A faster Ollama for Apple Silicon, with receipts

Rapid-MLX is an OpenAI-compatible local LLM server optimized for M-series Macs, claiming 2-4× speedups over Ollama and llama.cpp.

Rapid-MLX
Velocity · 7d
+26
★ / day
Trend
steady
star history

What it does

Rapid-MLX wraps Apple’s MLX framework into a drop-in OpenAI API replacement. Install via Homebrew or pip, run rapid-mlx serve, and point Cursor, Claude Code, Aider, or any OpenAI-compatible client at localhost:8000/v1. It handles model downloads, quantization, tool calling, prompt caching, and even vision/audio models through optional extras.

The interesting bit

The project ships a “Model-Harness Index” (MHI) — a weighted score combining tool-calling accuracy, HumanEval coding tasks, and MMLU knowledge retention — to tell you which model actually works with which agent framework. This is the boring compatibility matrix made useful: Qwopus 27B scores 92 across all tested harnesses, while Gemma 4 26B hits 100% tool calling with Hermes but 0% on HumanEval.

Key highlights

  • Claims 160 tok/s on a 16 GB MacBook Air (Qwen3.5-4B) and up to 141 tok/s for a 30B model on 32 GB machines
  • 17 tool parsers with “100% tool calling” on several model+harness combinations per MHI tables
  • 0.08s cached TTFT (time-to-first-token) with prompt cache support
  • One-command setup for popular agents: rapid-mlx agents opencode --setup wires OpenCode automatically
  • Optional vision (~322 MB extra) and audio extras via mlx-vlm and mlx-audio

Caveats

  • macOS-only; Apple Silicon required (M1-M4 supported)
  • Python 3.10+ required — macOS still ships 3.9, so expect version headaches if not using Homebrew
  • The “2-4× faster than Ollama” claim is stated but no independent benchmark methodology is shown in the README

Verdict

Mac developers already running local LLMs who are hitting Ollama’s speed ceiling should evaluate this. Windows or Linux users, and anyone without at least 16 GB unified memory, need not apply.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.