← all repositories

ztxz16/fastllm

A backend-independent, C++-based high-performance large language model inference library supporting dense and MoE architectures with tensor parallelism and FP8/INT4 quantization.

4.8k stars C++ Inference · Serving
fastllm
Velocity · 7d
+4.2
★ / day
Trend
steady
star history

fastllm replaces PyTorch with custom C++ operators to deliver high-throughput LLM inference. It supports dense models (Qwen, Llama, Phi) and MoE models (DeepSeek, Qwen-moe), with tensor parallelism across multiple GPUs and mixed CPU/GPU inference for extremely large models. The library achieves 20+ tokens per second for full-precision DeepSeek R1 671B and 30+ tokens per second for INT4-quantized variants on a single GPU, targeting production deployment scenarios.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.