Is DiT-Extrapolation open source?

Yes — thu-ml/DiT-Extrapolation is open source, released under the Apache-2.0 license.

What language is DiT-Extrapolation written in?

thu-ml/DiT-Extrapolation is primarily written in Python.

How popular is DiT-Extrapolation?

thu-ml/DiT-Extrapolation has 820 stars on GitHub.

Where can I find DiT-Extrapolation?

thu-ml/DiT-Extrapolation is on GitHub at https://github.com/thu-ml/DiT-Extrapolation.

← all repositories

thu-ml/DiT-Extrapolation

A Single-Line RoPE Tweak for Longer AI Videos

It tweaks the positional encoding of off-the-shelf video diffusion transformers so they can generate longer clips without retraining or heavy fine-tuning.

★820 stars Python Image · Video · Audio Inference · Serving

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does RIFLEx, UltraViCo, and UltraImage are a family of plug-and-play methods that stretch the temporal or spatial limits of pre-trained diffusion transformers. The core trick is a surgical edit to the 1D Rotary Position Embedding (RoPE): the repository shows how to identify an intrinsic frequency in a pre-trained model and clamp it so the sinusoid stays within one period during extrapolation. For HunyuanVideo, this turns 5-second clips into 11-second ones; for CogVideoX-5B, 6 seconds become 12. The codebase spans multiple branches covering video (HunyuanVideo, CogVideoX, Wan2.1) and image (Flux, Qwen-Image) models.

The interesting bit The authors bill RIFLEx as a “free lunch,” and the README literally highlights the change as a single-line modification inside the standard RoPE function. That line—pinning one frequency component to 0.9 * 2 * torch.pi / L_test—is what keeps the attention mechanism from drifting out of distribution when the frame count grows. It is a rare case where a position-embedding band-aid generalizes to production-grade models without architectural surgery.

Key highlights

Supports training-free inference and optional fine-tuned checkpoints (HunyuanVideo-RIFLEx, CogVideoX-RIFLEx).
Branches cover Wan2.1, HunyuanVideo, Flux, and Qwen-Image, so the same idea applies to both video and high-resolution image generation.
Multi-GPU inference branch exists for reproducing the project-page demos at full fidelity.
The modification is model-agnostic to any diffusion transformer using RoPE; the repository provides a utility to find the intrinsic frequency index k.

Caveats

The README notes that a few videos may show repetition in tail frames, so the conservative 0.9 multiplier deliberately leaves some headroom below a full period.
Single-GPU Diffusers inference uses BitsAndBytesConfig quantization, which the authors warn can affect output quality compared with the multi-GPU reference.

Verdict Anyone running RoPE-based video or image diffusion models who needs longer or higher-resolution output without training from scratch should look here. If your model does not use RoPE, this repository is irrelevant.

Frequently asked

What is thu-ml/DiT-Extrapolation?: It tweaks the positional encoding of off-the-shelf video diffusion transformers so they can generate longer clips without retraining or heavy fine-tuning.
Is DiT-Extrapolation open source?: Yes — thu-ml/DiT-Extrapolation is open source, released under the Apache-2.0 license.
What language is DiT-Extrapolation written in?: thu-ml/DiT-Extrapolation is primarily written in Python.
How popular is DiT-Extrapolation?: thu-ml/DiT-Extrapolation has 820 stars on GitHub.
Where can I find DiT-Extrapolation?: thu-ml/DiT-Extrapolation is on GitHub at https://github.com/thu-ml/DiT-Extrapolation.