xlite-dev/Awesome-LLM-Inference
A categorized collection of research papers on LLM and VLM inference optimization with associated open-source implementations.

Velocity · 7d
+5.2
★ / day
Trend
→steady
star history
This repository aggregates academic and engineering papers focused on large language model inference optimization. It covers techniques including Flash Attention, Paged Attention, quantization methods (INT8/INT4), parallelism strategies, and inference runtimes such as vLLM and TensorRT-LLM. The list is organized by topic and includes links to code repositories for referenced papers.