huggingface/transformers-bloom-inference
Fast inference implementation for the BLOOM 176B language model with multi-GPU support and int8 quantization.

Velocity · 7d
+0.4
★ / day
Trend
→steady
star history
This repository provides demos and packages for running efficient inference on the BLOOM large language model. It supports inference via HuggingFace accelerate and DeepSpeed Inference, with options for fp16/bf16 and int8 quantized deployment on multi-GPU setups. It leverages LLM.int8() and ZeroQuant techniques for post-training quantization to reduce memory footprint while maintaining generation quality.