Is transformers-bloom-inference open source?

Yes — huggingface/transformers-bloom-inference is open source, released under the Apache-2.0 license.

What language is transformers-bloom-inference written in?

huggingface/transformers-bloom-inference is primarily written in Python.

How popular is transformers-bloom-inference?

huggingface/transformers-bloom-inference has 566 stars on GitHub.

Where can I find transformers-bloom-inference?

huggingface/transformers-bloom-inference is on GitHub at https://github.com/huggingface/transformers-bloom-inference.

← all repositories

huggingface/transformers-bloom-inference

This BLOOM 176B Serving Kit Has Surrendered to vLLM

Demo scripts and server wrappers for running BLOOM 176B on multi-A100 clusters using Accelerate and DeepSpeed, now archived.

★566 stars Python Inference · Serving Language Models

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

This repository houses demo scripts, benchmark harnesses, and a bare-bones synchronous server for running inference on the 176-billion-parameter BLOOM model. It primarily glues together Hugging Face Accelerate and DeepSpeed Inference, with support for fp16, bf16, and int8 quantization schemes. The authors explicitly note that the scripts are tightly coupled to specific hardware—tested only on eight A100 80GB GPUs for full precision and four for quantized runs—and may not transfer cleanly to other models or GPU topologies.

The interesting bit

The repo is archived because the maintainers themselves concede that “a lot more efficient serving frameworks have been released recently like vLLM and TGI.” That honesty is refreshing: it is essentially a historical artifact of the pre-vLLM era, complete with a self-deprecating “crappy” Flask UI and borrowed DeepSpeed MII logic. It also offers a rare side-by-side comparison of two different quantization recipes—LLM.int8() via Accelerate and ZeroQuant via DeepSpeed—on the same massive target model.

Key highlights

Archived and unmaintained; superseded by vLLM and TGI.
Targets BLOOM 176B specifically, tested on 8× A100 80GB (fp16/bf16) or 4× A100 80GB (int8).
Supports both Hugging Face Accelerate and DeepSpeed Inference backends.
Includes CLI, benchmark utilities, a synchronous generation server, and a minimal web UI.
Quantization paths differ by backend: LLM.int8() for Accelerate, ZeroQuant for DeepSpeed.

Caveats

The serving method is strictly synchronous, forcing users to wait in a single queue.
The bundled UI is rudimentary, and the authors openly apologize for its design.
Scripts are not guaranteed to work with other models or different GPU counts.

Verdict

Grab this if you are studying the evolution of large-model serving or need a reference for legacy BLOOM 176B deployments on exact A100 topologies. Skip it if you are building a new production pipeline—vLLM or TGI are the maintainers’ own recommended escape hatches.

Frequently asked

What is huggingface/transformers-bloom-inference?: Demo scripts and server wrappers for running BLOOM 176B on multi-A100 clusters using Accelerate and DeepSpeed, now archived.
Is transformers-bloom-inference open source?: Yes — huggingface/transformers-bloom-inference is open source, released under the Apache-2.0 license.
What language is transformers-bloom-inference written in?: huggingface/transformers-bloom-inference is primarily written in Python.
How popular is transformers-bloom-inference?: huggingface/transformers-bloom-inference has 566 stars on GitHub.
Where can I find transformers-bloom-inference?: huggingface/transformers-bloom-inference is on GitHub at https://github.com/huggingface/transformers-bloom-inference.