The 10-gig horse: Stable Diffusion's original research release
The CompVis reference implementation that proved latent diffusion could run on consumer GPUs.

What it does
Stable Diffusion v1 generates 512×512 images from text prompts using a latent diffusion model: an 860M-parameter UNet and a frozen CLIP ViT-L/14 text encoder, trained on LAION-5B subsets. The repo provides reference sampling scripts (txt2img.py, img2img.py) and links to four progressively refined checkpoints (v1-1 through v1-4).
The interesting bit The “latent” part is the trick: diffusion happens in a compressed 8× downsampled space rather than raw pixels, which is why a model of this quality fits in 10 GB VRAM instead of a server farm. The authors explicitly call the weights “research artifacts” and ship them under a use-restricted OpenRAIL license with a safety checker and invisible watermarking—unusual candor about misuse risks in a release this popular.
Key highlights
- Four published checkpoints with documented training curricula (256→512, aesthetic filtering, classifier-free guidance tuning)
- Reference scripts include PLMS sampler, safety checker, and invisible watermarking
- Hugging Face
diffusersintegration provided as the preferred community path img2img.pysupports SDEdit-style translation and upscaling via noise strength- Builds on OpenAI’s ADM codebase and lucidrains’ diffusion implementations
Caveats
- The README warns against commercial deployment without additional safety mechanisms
- EMA-only vs full checkpoints have a footgun:
use_ema=Falseis required for inference config compatibility - Environment setup is conda-centric with pinned dependency versions (transformers==4.19.2)
Verdict
Worth studying if you want to understand how latent diffusion actually works under the hood, or need the original checkpoints for reproducibility research. Most practitioners should probably use the Hugging Face diffusers pipeline instead—this repo is a paper reference, not a product.