EzioBy/Ditto
Ditto is a framework for generating high-quality synthetic video editing data to train an instruction-based video editing model called Editto.

The project introduces a scalable data generation pipeline that combines a leading image editor with an in-context video generator to overcome data scarcity in video editing. It employs a distilled model architecture with a temporal enhancer to balance cost and quality. An intelligent agent crafts diverse editing instructions and filters outputs to ensure quality at scale. The resulting Ditto-1M dataset contains one million high-fidelity video editing examples, used to train the Editto model to state-of-the-art performance.