Is x-stable-diffusion open source?

Yes — stochasticai/x-stable-diffusion is open source, released under the Apache-2.0 license.

What language is x-stable-diffusion written in?

stochasticai/x-stable-diffusion is primarily written in Jupyter Notebook.

How popular is x-stable-diffusion?

stochasticai/x-stable-diffusion has 557 stars on GitHub.

Where can I find x-stable-diffusion?

stochasticai/x-stable-diffusion is on GitHub at https://github.com/stochasticai/x-stable-diffusion.

← all repositories

stochasticai/x-stable-diffusion

A bake-off of every Stable Diffusion speed trick

It exists so you don't have to compile AITemplate, TensorRT, and FlashAttention yourself just to find the fastest Stable Diffusion pipeline on your GPU.

★557 stars Jupyter Notebook Inference · Serving Image · Video · Audio

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

x-stable-diffusion gathers half a dozen NVIDIA-centric acceleration frameworks—AITemplate, TensorRT, nvFuser, FlashAttention, ONNX, and vanilla PyTorch—into a single benchmarking suite. It publishes latency, VRAM, and batch-size tables for A100 and T4 GPUs alongside generated sample images, effectively acting as a reference implementation collection with numbers attached. A CLI wrapper called stochasticx is provided for local deployment, though the repo’s real payload is the comparative data.

The interesting bit

The README treats model compilation like a horse race, but keeps the commentary honest: AITemplate wins on A100 latency and batch scalability, while TensorRT hits out-of-memory errors when batching and AITemplate may not even run on a T4. That transparency is useful.

Key highlights

On an A100, AITemplate delivers the lowest single-image latency at 1.38 s and scales cleanly to batch size 24, using just 4.83 GB of VRAM for one image.
The advertised 0.88 s “real-time” figure requires cutting num_inference_steps to 30 and keeping max_seq_length at 64—a different configuration than the 50-step benchmark tables.
TensorRT reaches 1.68 s on a single A100 image but fails to convert the UNet under batching due to memory issues, making it effectively single-image only here.
FlashAttention nearly halves latency versus baseline PyTorch (2.80 s vs. 5.77 s) while dropping VRAM from 10.3 GB to 7.5 GB.
Each optimizer includes a dedicated Colab notebook and manual deployment README, so the repo doubles as a cookbook.

Caveats

AITemplate may not support T4 GPUs yet, leaving TensorRT as the fastest tested option on that card.
TensorRT’s batching limitations and ONNX’s repeated OOMs suggest these paths are currently best for single-image A100 workloads.
The 0.88 s claim and the main benchmark tables rely on different step counts, so comparing them requires reading the fine print.

Verdict

Useful if you are shopping for an A100 inference engine and want hard latency numbers before committing to a compiler. Less useful if you need robust T4 support or deep analysis of image-quality trade-offs, since the README stays firmly in the speed-and-memory lane.

Frequently asked

What is stochasticai/x-stable-diffusion?: It exists so you don't have to compile AITemplate, TensorRT, and FlashAttention yourself just to find the fastest Stable Diffusion pipeline on your GPU.
Is x-stable-diffusion open source?: Yes — stochasticai/x-stable-diffusion is open source, released under the Apache-2.0 license.
What language is x-stable-diffusion written in?: stochasticai/x-stable-diffusion is primarily written in Jupyter Notebook.
How popular is x-stable-diffusion?: stochasticai/x-stable-diffusion has 557 stars on GitHub.
Where can I find x-stable-diffusion?: stochasticai/x-stable-diffusion is on GitHub at https://github.com/stochasticai/x-stable-diffusion.