Is vllm-mlx open source?

Yes — waybarrios/vllm-mlx is open source, released under the Apache-2.0 license.

What language is vllm-mlx written in?

waybarrios/vllm-mlx is primarily written in Python.

How popular is vllm-mlx?

waybarrios/vllm-mlx has 1.5k stars on GitHub.

Where can I find vllm-mlx?

waybarrios/vllm-mlx is on GitHub at https://github.com/waybarrios/vllm-mlx.

← all repositories

waybarrios/vllm-mlx

Apple Silicon gets a vLLM-style server with OpenAI and Anthropic APIs

It exposes both OpenAI and Anthropic APIs from a single Apple Silicon process so you can point Claude Code at a local Llama or Qwen model with continuous batching.

★1.5k stars Python Inference · Serving Language Models Agents Coding Assistants

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

vllm-mlx is an inference server built on Apple’s MLX framework that ports vLLM-style serving—continuous batching, paged KV cache, and prefix caching—to M-series Macs. It bundles language, vision, audio, and embedding backends behind one HTTP server that speaks both OpenAI and Anthropic API dialects. Think of it as giving your MacBook a datacenter-grade scheduler without the datacenter.

The interesting bit

Most local servers settle for OpenAI compatibility and call it a day. This one also implements Anthropic’s /v1/messages endpoint, so Claude Code and other Anthropic-native clients connect unmodified. It will even spill its KV cache to SSD or preload warm prompts at startup to shave time-to-first-token—optimizations you don’t usually see on a laptop.

Key highlights

Benchmarks show 417 tok/s decode for a small Qwen3 model and ~128 tok/s for a 30B MoE on an M4 Max.
Handles LLMs, vision models like Qwen-VL and Pixtral, Whisper STT, and native TTS from a single process.
Supports reasoning extraction for Qwen3 and DeepSeek-R1, structured JSON output, and MCP tool calling.
Includes a built-in benchmarker and Prometheus /metrics for prod-like observability.
Runs fully offline on unified memory with no model conversion step.

Caveats

Apple Silicon only—no CUDA, Linux, or Windows path exists.
The reranker supports only standard BERT activations and fails explicitly on custom architectures.
Audio support requires additional system dependencies and spacy language models beyond the base install.

Verdict

A solid choice if you want one local Mac backend that satisfies both OpenAI and Anthropic SDKs. Look elsewhere if your hardware has an NVIDIA logo on it.

Frequently asked

What is waybarrios/vllm-mlx?: It exposes both OpenAI and Anthropic APIs from a single Apple Silicon process so you can point Claude Code at a local Llama or Qwen model with continuous batching.
Is vllm-mlx open source?: Yes — waybarrios/vllm-mlx is open source, released under the Apache-2.0 license.
What language is vllm-mlx written in?: waybarrios/vllm-mlx is primarily written in Python.
How popular is vllm-mlx?: waybarrios/vllm-mlx has 1.5k stars on GitHub.
Where can I find vllm-mlx?: waybarrios/vllm-mlx is on GitHub at https://github.com/waybarrios/vllm-mlx.