Is FlexLLMGen open source?

Yes — FMInference/FlexLLMGen is open source, released under the Apache-2.0 license.

What language is FlexLLMGen written in?

FMInference/FlexLLMGen is primarily written in Python.

How popular is FlexLLMGen?

FMInference/FlexLLMGen has 9.4k stars on GitHub.

Where can I find FlexLLMGen?

FMInference/FlexLLMGen is on GitHub at https://github.com/FMInference/FlexLLMGen.

← all repositories

FMInference/FlexLLMGen

A 175B-parameter model on one GPU, if you can wait

It lets you batch-process millions of tokens on cheap, limited hardware by trading latency for throughput and using CPU and disk as extended memory tiers.

★9.4k stars Python Inference · Serving Language Models

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does FlexLLMGen is an inference engine that runs large language models—including OPT-175B—on a single commodity GPU by spilling weights, activations, and KV caches to CPU RAM and even local SSD. It is built for throughput-oriented batch jobs like benchmarking, data wrangling, and document classification, where the metric that matters is total tokens processed per dollar, not time-to-first-token. A linear programming optimizer searches for the best placement of tensors across the memory hierarchy, while the system batches requests aggressively to keep the GPU fed despite the offloading overhead.

The interesting bit Instead of pretending offloading is free, FlexLLMGen leans into the latency-throughput trade-off: it uses massive effective batch sizes and 4-bit compression on both weights and the attention cache to turn painful memory transfers into sustained throughput. The authors claim this pushes the Pareto frontier far beyond what other offloading systems can reach before they run out of memory.

Key highlights

Runs OPT-175B on an NVIDIA T4 with 16 GB VRAM, 208 GB DRAM, and 1.5 TB SSD in the reference benchmark.
Published benchmarks show it outperforming Hugging Face Accelerate, DeepSpeed ZeRO-Inference, and Petals on throughput for large models, particularly when offloading to CPU or disk.
Compresses weights and KV cache to 4 bits with what the authors describe as negligible accuracy loss.
Exposes a Hugging Face-style model.generate API for easier integration.
Supports pipeline parallelism across distributed GPUs when aggregated memory is still insufficient.

Caveats

The authors explicitly warn that it is significantly slower than a full-GPU setup for small batches and is not designed for interactive or chat use.
Offloading strategies must be tuned manually via --percent flags; an automatic policy optimizer is listed as future work.
Only a subset of HELM benchmark scenarios have been tested.

Verdict Worth a look if you are running overnight batch inference on a budget cloud instance or a lone workstation GPU. Look elsewhere if you need low-latency responses or already have enough VRAM to hold the entire model.

Frequently asked

What is FMInference/FlexLLMGen?: It lets you batch-process millions of tokens on cheap, limited hardware by trading latency for throughput and using CPU and disk as extended memory tiers.
Is FlexLLMGen open source?: Yes — FMInference/FlexLLMGen is open source, released under the Apache-2.0 license.
What language is FlexLLMGen written in?: FMInference/FlexLLMGen is primarily written in Python.
How popular is FlexLLMGen?: FMInference/FlexLLMGen has 9.4k stars on GitHub.
Where can I find FlexLLMGen?: FMInference/FlexLLMGen is on GitHub at https://github.com/FMInference/FlexLLMGen.