meta-pytorch/gpt-fast
Minimalist PyTorch-based inference engine for transformer text generation models including LLaMA, Mixtral, Gemma, Grok-1, and DBRX.

gpt-fast is a lightweight text generation framework written in under 1000 lines of Python that leverages native PyTorch operations for high-performance inference. It supports key optimization techniques including int8 and int4 quantization, speculative decoding, and tensor parallelism across Nvidia and AMD GPUs. The project targets various transformer architectures from the LLaMA family and Mixture-of-Experts models, positioned as an educational reference implementation rather than a production framework.