jjang-ai/vmlx
Self-hosted inference server for LLMs, VLMs, and image generation models running on Apple Silicon hardware using the MLX framework.

vMLX is an inference server designed specifically for Apple Silicon (M1-M4 chips) running MLX-optimized language models. It provides OpenAI, Anthropic, and Ollama compatible HTTP APIs, enabling self-hosted LLM deployment without third-party API keys. The server implements advanced optimizations including L2 disk-based KV cache persistence, L1 paged memory management for fast time-to-first-token, hybrid SSM scheduling, and continuous batching to maximize throughput on Apple hardware.