← all repositories

vectorch-ai/ScaleLLM

A high-performance C++ inference runtime for large language models with GPU acceleration and speculative decoding.

500 stars C++ Inference · Serving
ScaleLLM
Velocity · 7d
+0.5
★ / day
Trend
steady
star history

ScaleLLM is a production-grade LLM inference system written in C++. It provides GPU acceleration via CUDA for efficient serving of large language models and supports popular open-source models including Llama3.1, Gemma2, and Phi. The system targets production environments with optimizations like speculative decoding for improved throughput.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.