A 22B-parameter video studio that fits in your GPU (barely)
Lightricks open-sources the full inference stack and LoRA trainer for their DiT-based audio-video model, complete with camera-control LoRAs and HDR output pipelines.

What it does LTX-2 is a 22-billion-parameter diffusion transformer that generates synchronized audio and video from text, images, or existing clips. This repository provides the official Python inference code and LoRA training tools, plus ten pre-built pipelines ranging from quick single-stage prototyping to two-stage production-quality generation with spatial upsampling.
The interesting bit The model ships with a small arsenal of task-specific LoRAs—camera dolly left/right, jib up/down, pose control, motion tracking, even lip-dubbing with speaker identity matching. There’s also an HDR pipeline that outputs linear float frames via LogC3 inverse decode, meant for EXR export and professional tonemapping. Lightricks essentially bundled a post house’s worth of tooling into one checkpoint.
Key highlights
- Ten pipelines: text/image-to-video, audio-to-video, keyframe interpolation, video retakes, HDR output, lip dubbing
- FP8 quantization support (
fp8-castfor bf16 checkpoints,fp8-scaled-mmfor TensorRT-LLM on Hopper) - FlashAttention 4 for datacenter Blackwell (B200), xFormers for other CUDA GPUs
- Distilled variant runs inference in 8+4 steps; gradient estimation can cut steps from 40 to 20-30
- Automatic prompt enhancement built into all pipelines
- ComfyUI integration available via separate repo
Caveats
- Requires multiple large safetensors downloads (22B base model, spatial upscaler, distilled LoRA, Gemma 3 text encoder)
- Temporal upscaler is “supported by the model” but not yet required by any pipeline—future work
- FlashAttention 4 install is pinned to a specific beta (
4.0.0b9) against torch 2.9.1+cu128; newer betas break on consumer Blackwell
Verdict Worth a look if you’re doing local video generation research or need fine-grained camera/pose control without building pipelines from scratch. Skip it if you’re hoping for a lightweight, single-file demo—this is a production-weight stack with production-weight hardware requirements.