ztxz16/fastllm
A backend-independent, C++-based high-performance large language model inference library supporting dense and MoE architectures with tensor parallelism and FP8/INT4 quantization.

fastllm replaces PyTorch with custom C++ operators to deliver high-throughput LLM inference. It supports dense models (Qwen, Llama, Phi) and MoE models (DeepSeek, Qwen-moe), with tensor parallelism across multiple GPUs and mixed CPU/GPU inference for extremely large models. The library achieves 20+ tokens per second for full-precision DeepSeek R1 671B and 30+ tokens per second for INT4-quantized variants on a single GPU, targeting production deployment scenarios.