FMInference/FlexLLMGen
High-throughput LLM inference engine for running large language models on a single commodity GPU with memory-efficient offloading.

Velocity · 7d
+7.7
★ / day
Trend
→steady
star history
FlexLLMGen is a generation engine designed to run large language models with limited GPU memory while maximizing throughput. It achieves high throughput through IO-efficient offloading, compression techniques, and large effective batch sizes. The system targets throughput-oriented workloads such as benchmarking, information extraction, and data wrangling where latency is less critical than processing many tokens per second.