ByteDance's 3B-parameter do-it-all visual model
A single small model that generates, edits, and understands images and video—no separate pipelines required.

What it does
Lance is a 3B-active-parameter unified multimodal model from ByteDance that handles image and video understanding, generation, and editing in one framework. You can prompt it for text-to-image, text-to-video, image-to-video, plus editing tasks for both modalities, or ask it to describe what it sees.
The interesting bit
The “native unified” claim is the hook: instead of gluing together a diffusion model, an LLM, and an editor, Lance is trained from scratch on all these tasks simultaneously using a staged multi-task recipe. The authors explicitly call it a “research artifact”—they trained on up to 128 A100s, capped at 768×768 images and 480p/12 FPS video, and want the community to stress-test whether this synergy actually works at small scale.
Key highlights
- 3B active parameters, competitive on image generation, editing, and video generation benchmarks (per the README’s claim; no specific numbers shown)
- Supports 7 task types:
t2i,t2v,i2v,image_edit,video_edit,x2t_image,x2t_video - Now runs in vLLM-Omni for faster inference; Gradio demo and HuggingFace Space available
- Requires 40GB+ VRAM for inference (A100 territory)
- Fine-tuning code not yet released
Caveats
- Output quality “may vary across prompts, resolutions, duration, motion complexity, and editing scenarios”—the authors’ own warning
- Flash Attention compilation can be finicky; README points to third-party wheels “for reference only”
- Trained up to 480p video; don’t expect cinema-grade generation
Verdict
Worth a spin if you’re researching unified multimodal architectures or need a single model that covers multiple visual tasks without model-swapping. Skip it if you need production polish, higher resolutions, or GPU budgets under 40GB.