← all repositories
dphnAI/sonar

The vLLM-based engine that eats 92% of your GPU by design

Aphrodite Engine wraps vLLM in an OpenAI-compatible API and adds enough quantization formats to make a compression researcher weep.

sonar
Not currently ranked — collecting fresh signals.
star history

What it does Aphrodite is a C++ inference engine for serving HuggingFace transformers at scale to multiple concurrent users. It exposes an OpenAI-compatible REST API and serves as the production backend for PygmalionAI’s chat infrastructure. Under the hood, it leans heavily on vLLM’s PagedAttention and continuous batching, then layers on its own CUDA kernel optimizations and serving logic.

The interesting bit The project’s real obsession appears to be quantization: the README lists nearly twenty distinct formats, from AWQ and GPTQ to ExLlamaV3, BitNet, and NVIDIA’s FP4 variants, plus quantized KV caches in FP8 and TurboQuant. It also packs in speculative decoding via EAGLE, DFlash, ngram, and MTP, alongside chat-centric samplers like DRY and XTC that you won’t find in a typical base vLLM setup.

Key highlights

  • Supports an almost comically broad range of quantization schemes (AQLM, GGUF, Marlin, MXFP4, VPTQ, and many more)
  • Runs speculative decoding via multiple algorithms including EAGLE and MTP
  • Offers multimodal support and multi-LoRA serving
  • Ships with an OpenAI-compatible API server on port 2242 by default
  • Aggressively allocates 92% of available VRAM out of the box, assuming you’re serving at scale

Caveats

  • Requires CUDA 12 and is officially limited to Linux and WSL2
  • The 92% default VRAM allocation will surprise anyone running it on a shared workstation or desktop
  • Python 3.14 support requires building from source

Verdict Worth a look if you’re running a chat or roleplay platform that needs exotic quantization and speculative decoding under an OpenAI-shaped API. Skip it if you just want a lightweight local LLM runner or lack CUDA 12.

Frequently asked

What is dphnAI/sonar?
Aphrodite Engine wraps vLLM in an OpenAI-compatible API and adds enough quantization formats to make a compression researcher weep.
Is sonar open source?
Yes — dphnAI/sonar is open source, released under the AGPL-3.0 license.
What language is sonar written in?
dphnAI/sonar is primarily written in C++.
How popular is sonar?
dphnAI/sonar has 1.8k stars on GitHub.
Where can I find sonar?
dphnAI/sonar is on GitHub at https://github.com/dphnAI/sonar.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.