Is flashinfer open source?

Yes — flashinfer-ai/flashinfer is open source, released under the Apache-2.0 license.

What language is flashinfer written in?

flashinfer-ai/flashinfer is primarily written in Python.

How popular is flashinfer?

flashinfer-ai/flashinfer has 6k stars on GitHub and is currently holding steady.

Where can I find flashinfer?

flashinfer-ai/flashinfer is on GitHub at https://github.com/flashinfer-ai/flashinfer.

← all repositories

flashinfer-ai/flashinfer

The Kernel Switchboard That SGLang and vLLM Both Use

It exists so LLM serving frameworks can stop rewriting the same CUDA kernels for every GPU generation and batching mode.

★6k stars Python Inference · Serving

View on GitHub ↗ Homepage ↗

Velocity · 7d

+5.6

★ / day

Trend

→steady

star history

What it does

FlashInfer is a kernel library and generator for LLM inference that provides unified APIs for attention, GEMM, and mixture-of-experts operations. It acts as a dispatch layer across multiple backend implementations—FlashAttention-2/3, cuDNN, CUTLASS, and TensorRT-LLM—automatically selecting whichever fits the current hardware and workload. The library covers the full inference loop, from paged KV-cache management and multi-latent attention to sorting-free sampling, speculative decoding, and multi-node communication.

The interesting bit

Instead of committing to a single kernel strategy, FlashInfer treats high-performance kernels as a portfolio problem: it brings together implementations from competing ecosystems and routes to the best one at runtime. It also exposes a JIT path for custom attention variants, meaning you can inject your own fused operations without forking the whole stack.

Key highlights

Backend-agnostic dispatching across FlashAttention, cuDNN, CUTLASS, and TensorRT-LLM
Broad GPU coverage from Turing (SM7.5) through Blackwell (SM12.x), with FP8 and FP4 low-precision paths
Native support for DeepSeek-style MLA attention, cascade KV-cache sharing, and fused prefill-decode (POD) batching
Adopted by SGLang, vLLM, TensorRT-LLM, TGI, MLC-LLM, and others
CUDAGraph and torch.compile compatible for low-latency production serving

Caveats

Feature availability is fragmented across GPU generations; the README explicitly warns that not all capabilities run on every compute architecture
The default package compiles or downloads kernels on first use, so expect a warm-up tax before the first inference call

Verdict

FlashInfer belongs in your stack if you are building or maintaining an LLM inference engine and want to outsource kernel optimization without locking into a single backend ecosystem. End-users running models through vLLM or SGLang are likely already benefiting from it indirectly.

Frequently asked

What is flashinfer-ai/flashinfer?: It exists so LLM serving frameworks can stop rewriting the same CUDA kernels for every GPU generation and batching mode.
Is flashinfer open source?: Yes — flashinfer-ai/flashinfer is open source, released under the Apache-2.0 license.
What language is flashinfer written in?: flashinfer-ai/flashinfer is primarily written in Python.
How popular is flashinfer?: flashinfer-ai/flashinfer has 6k stars on GitHub and is currently holding steady.
Where can I find flashinfer?: flashinfer-ai/flashinfer is on GitHub at https://github.com/flashinfer-ai/flashinfer.