vectorch-ai/ScaleLLM
A high-performance C++ inference runtime for large language models with GPU acceleration and speculative decoding.

Velocity · 7d
+0.5
★ / day
Trend
→steady
star history
ScaleLLM is a production-grade LLM inference system written in C++. It provides GPU acceleration via CUDA for efficient serving of large language models and supports popular open-source models including Llama3.1, Gemma2, and Phi. The system targets production environments with optimizations like speculative decoding for improved throughput.