brontoguana/krasis
A Rust/CUDA-based LLM runtime for efficiently running multi-hundred-billion-parameter MoE models on consumer NVIDIA GPUs.

Krasis is an inference engine that executes large language models on consumer-grade hardware by managing expert residency between VRAM and CPU RAM. It moves performance-critical paths from Python to Rust and CUDA, providing GPU-accelerated prefill and decode, HQQ attention caching, and compact KV cache formats like k6v6 and k4v4. The project includes an OpenAI-compatible API, interactive launcher, benchmark tooling, and GitHub-release based installation.