← all repositories
Ai00-X/ai00_server

RWKV inference without the CUDA tax

A Rust-based LLM server that runs on any Vulkan-capable GPU—including your laptop's integrated graphics.

ai00_server
Velocity · 7d
+0.6
★ / day
Trend
steady
star history

What it does

AI00 is an inference API server for the RWKV language model, built in Rust on top of the web-rwkv engine. It exposes OpenAI-compatible endpoints for chat, completions, and embeddings. You download a model, drop it in assets/models/, and run a single binary. A WebUI listens on port 65530.

The interesting bit

The project bets on Vulkan instead of CUDA, which means AMD cards and even integrated GPUs get acceleration. No PyTorch, no CUDA toolkit, no 5 GB of dependencies—just a compact binary. The README is unusually emphatic about this point, with multiple exclamation marks.

Key highlights

  • OpenAI API-compatible endpoints (/api/oai/v1/chat/completions, /api/oai/v1/embeddings, etc.)
  • Supports int8 and NF4 quantization, plus LoRA models
  • BNF sampling since v0.5: constrain output to valid JSON or other grammars by restricting next-token choices
  • Hot loading and switching of tuned initial states (LoRA hot-switching is still on the TODO list)
  • Models must be in Safetensors .st format; .pth files need conversion via included Python script or standalone converter

Caveats

  • The README’s claim that this is “your best choice” for a fast LLM API server is unsupported by benchmarks; no performance numbers are provided
  • Several TODO items remain unchecked, including hot loading of LoRA models
  • The project description mentions “RAG” and “AI agents,” but the README itself focuses on inference, chat, and embeddings—those broader features are unclear from the current documentation

Verdict

Worth a look if you’re running non-Nvidia hardware or just want to escape the CUDA/PyTorch dependency spiral. Skip it if you need mature RAG pipelines, agent frameworks, or proven throughput metrics.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.