intel/xFasterTransformer
An optimized inference solution for running large language models on Intel X86/Xeon platforms.

Velocity · 7d
+0.4
★ / day
Trend
→steady
star history
xFasterTransformer provides a GPU FasterTransformer-equivalent solution for CPU-based LLM inference on Intel Xeon processors. It supports distributed inference across multiple sockets and nodes for running larger models, and offers both C++ and Python APIs spanning from high-level to low-level interfaces. The project supports popular model architectures including Qwen, DeepSeek-R1, ChatGLM, and LLaMA, and includes integration with vLLM for OpenAI-compatible serving.