← all repositories
bytedance/Bernini

ByteDance's video editor thinks before it renders

A research framework that uses a multimodal LLM to plan video edits semantically, then hands off to a diffusion transformer to actually draw the frames.

Bernini
Velocity · 7d
+56
★ / day
Trend
steady
collecting data…
star history

What it does

Bernini is a unified video generation and editing system from ByteDance. It splits the work into two stages: an MLLM-based “semantic planner” figures out what should happen in the video, and a DiT-based “renderer” (Bernini-R) actually generates the pixels. The open-sourced piece is the renderer, which handles text-to-image, image editing, text-to-video, video editing, and reference-guided video tasks through a shared pipeline.

The interesting bit

The planner-renderer split is the architectural bet. The planner works in latent semantic space — reasoning about motion, composition, and edits before any expensive diffusion steps — while the renderer inherits from Wan2.2-T2V-A14B and adds trained high-noise/low-noise transformer weights. For video editing specifically, the authors claim first-tier results against closed-source commercial models on a self-built human evaluation arena.

Key highlights

  • Supports six task types through one renderer: t2i, i2i, t2v, v2v, mv2v, rv2v, r2v
  • Two weight loading modes: a self-contained diffusers-format bundle (recommended), or separate Wan2.2 base + Bernini-R checkpoints
  • Multi-GPU inference via Ulysses sequence parallel (8-way in examples); single-GPU fallback for image tasks
  • Optional GPT-based prompt enhancer via OpenAI-compatible API
  • Gradio demo included; runs at 480p/16fps by default, with examples up to 720p/24fps

Caveats

  • The semantic planner itself is not open-sourced — only the renderer weights and inference code are available
  • Hardware expectations are steep: Hopper GPUs recommended for FlashAttention-3; CUDA 12.4 and Python 3.11.2 are essentially pinned
  • The “first tier” video editing claim comes from the authors’ own arena platform, not an independent benchmark

Verdict

Worth a look if you’re doing research in structured video editing or building on Wan2.2 and want a pretrained renderer with broad task coverage. Skip it if you need the full planner-renderer system end-to-end, or if your hardware tops out at an A100 and you were hoping for the fastest path.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.