RWKV inference without the CUDA tax
A Rust-based LLM server that runs on any Vulkan-capable GPU—including your laptop's integrated graphics.

What it does
AI00 is an inference API server for the RWKV language model, built in Rust on top of the web-rwkv engine. It exposes OpenAI-compatible endpoints for chat, completions, and embeddings. You download a model, drop it in assets/models/, and run a single binary. A WebUI listens on port 65530.
The interesting bit
The project bets on Vulkan instead of CUDA, which means AMD cards and even integrated GPUs get acceleration. No PyTorch, no CUDA toolkit, no 5 GB of dependencies—just a compact binary. The README is unusually emphatic about this point, with multiple exclamation marks.
Key highlights
- OpenAI API-compatible endpoints (
/api/oai/v1/chat/completions,/api/oai/v1/embeddings, etc.) - Supports
int8andNF4quantization, plus LoRA models - BNF sampling since v0.5: constrain output to valid JSON or other grammars by restricting next-token choices
- Hot loading and switching of tuned initial states (LoRA hot-switching is still on the TODO list)
- Models must be in Safetensors
.stformat;.pthfiles need conversion via included Python script or standalone converter
Caveats
- The README’s claim that this is “your best choice” for a fast LLM API server is unsupported by benchmarks; no performance numbers are provided
- Several TODO items remain unchecked, including hot loading of LoRA models
- The project description mentions “RAG” and “AI agents,” but the README itself focuses on inference, chat, and embeddings—those broader features are unclear from the current documentation
Verdict
Worth a look if you’re running non-Nvidia hardware or just want to escape the CUDA/PyTorch dependency spiral. Skip it if you need mature RAG pipelines, agent frameworks, or proven throughput metrics.