← all repositories

FMInference/FlexLLMGen

High-throughput LLM inference engine for running large language models on a single commodity GPU with memory-efficient offloading.

FlexLLMGen
Velocity · 7d
+7.7
★ / day
Trend
steady
star history

FlexLLMGen is a generation engine designed to run large language models with limited GPU memory while maximizing throughput. It achieves high throughput through IO-efficient offloading, compression techniques, and large effective batch sizes. The system targets throughput-oriented workloads such as benchmarking, information extraction, and data wrangling where latency is less critical than processing many tokens per second.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.