FFP-300K: Scaling First-Frame Propagation for Generalizable Video Editing

Xijie Huang1*, Chengming Xu2*, Donghao Luo2, Xiaobin Hu2, Peng Tang2, Xu Peng2, Jiangning Zhang2, Chengjie Wang2, Yanwei Fu1†
1 Fudan University 2 Tencent Youtu Lab
* Equal Contribution
Corresponding Author

Abstract

We introduce FFP-300K, a large-scale dataset of 300K high-fidelity video pairs at 720p resolution and 81 frames, constructed via a scalable two-track pipeline that supports both FFP-based and instruction-based video editing. Building on this dataset, we propose a guidance-free FFP framework with Adaptive Spatio-Temporal RoPE (AST-RoPE) and an identity propagation self-distillation objective, which balances first-frame appearance preservation and source video motion consistency. Comprehensive experiments on the EditVerseBench benchmark demonstrate that our method significantly outperforming existing academic and commercial models by receiving about 0.2 PickScore and 0.3 VLM score improvement against these competitors.

FFP-300K Data Construction Demo

Comparison in EditVerseBench

Landscape-oriented Video Editing

Portrait-oriented Video Editing