Raw Input Video
Reality is a dance between rigid constraints and deformable structures. For video models, that means generating motion that preserves fidelity as well as structure. Despite progress in diffusion models, producing realistic structure-preserving motion remains challenging, especially for articulated and deformable objects such as humans and animals. Scaling training data alone, so far, has failed to resolve physically implausible transitions. Existing approaches rely on conditioning with noisy motion representations, such as optical flow or skeletons extracted using an external imperfect model. To address these challenges, we introduce an algorithm to distill structure-preserving motion priors from an autoregressive video tracking model (SAM2) into a bidirectional video diffusion model (CogVideoX). With our method, we train SAM2VideoX, which contains two innovations: (1) a bidirectional feature fusion module that extracts global structure-preserving motion priors from a recurrent model like SAM2; (2) a Local Gram Flow loss that aligns how local features move together. Experiments on VBench and in human studies show that SAM2VideoX delivers consistent gains (+2.60% on VBench, 21-22% lower FVD, and 71.4% human preference) over prior baselines. Specifically, on VBench, we achieve 95.51%, surpassing REPA (92.91%) by 2.60%, and reduce FVD to 360.57, a 21.20% and 22.46% improvement over REPA- and LoRA-finetuning, respectively.
Raw Input Video
SAM2 Feature PCA
"SAM2's internal representations are dense, continuous, and temporally consistent. They capture object motion and part-level dynamics, providing rich structural priors that standard diffusion models lack."

Method overview. The framework consists of two parallel branches: (Top) The Motion Prior Extraction branch extracts forward and backward memory features from SAM2 given a clean video, and fuses them into a bidirectional teacher representation. (Bottom) The Video Generation Backbone takes noisy latents as input, and the intermediate DiT features Fdiff are projected into the SAM2 space as F̂diff. Then the proposed Local Gram Flow loss (Lfeat) is used to align the spatio-temporal structure of the projected student features with the teacher priors.
@article{fei2025sam2videox,
title = {Structure From Tracking: Distilling Structure-Preserving Motion for Video Generation},
author = {Fei, Yang and Stoica, George and Liu, Jingyuan and Chen, Qifeng and Krishna, Ranjay and Wang, Xiaojuan and Liu, Benlin},
journal = {arXiv preprint arXiv:2512.11792},
year = {2025},
}