ByteDance's video editor thinks before it renders
A research framework that uses a multimodal LLM to plan video edits semantically, then hands off to a diffusion transformer to actually draw the frames.

What it does
Bernini is a unified video generation and editing system from ByteDance. It splits the work into two stages: an MLLM-based “semantic planner” figures out what should happen in the video, and a DiT-based “renderer” (Bernini-R) actually generates the pixels. The open-sourced piece is the renderer, which handles text-to-image, image editing, text-to-video, video editing, and reference-guided video tasks through a shared pipeline.
The interesting bit
The planner-renderer split is the architectural bet. The planner works in latent semantic space — reasoning about motion, composition, and edits before any expensive diffusion steps — while the renderer inherits from Wan2.2-T2V-A14B and adds trained high-noise/low-noise transformer weights. For video editing specifically, the authors claim first-tier results against closed-source commercial models on a self-built human evaluation arena.
Key highlights
- Supports six task types through one renderer:
t2i,i2i,t2v,v2v,mv2v,rv2v,r2v - Two weight loading modes: a self-contained diffusers-format bundle (recommended), or separate Wan2.2 base + Bernini-R checkpoints
- Multi-GPU inference via Ulysses sequence parallel (8-way in examples); single-GPU fallback for image tasks
- Optional GPT-based prompt enhancer via OpenAI-compatible API
- Gradio demo included; runs at 480p/16fps by default, with examples up to 720p/24fps
Caveats
- The semantic planner itself is not open-sourced — only the renderer weights and inference code are available
- Hardware expectations are steep: Hopper GPUs recommended for FlashAttention-3; CUDA 12.4 and Python 3.11.2 are essentially pinned
- The “first tier” video editing claim comes from the authors’ own arena platform, not an independent benchmark
Verdict
Worth a look if you’re doing research in structured video editing or building on Wan2.2 and want a pretrained renderer with broad task coverage. Skip it if you need the full planner-renderer system end-to-end, or if your hardware tops out at an A100 and you were hoping for the fastest path.