International Conference on Computer Vision, ICCV 2025.
1 Nanyang Technological University 2 StepFun 3 Westlake University
MotionAgent is a novel framework that enables fine-grained motion control for text-guided image-to-video generation. At its core is a motion field agent that parses motion information in text prompts and converts it into explicit object trajectories and camera extrinsics. These motion representations are analytically integrated into a unified optical flow, which conditions a diffusion-based image-to-video model to generate videos with precise and flexible motion control. An optional rethinking step further refines motion alignment by iteratively correcting the agent’s previous actions.

Click the image above to watch the full video on YouTube 🎬
Follow the steps below to set up MotionAgent and run the demo smoothly 💫
Clone the official GitHub repository and enter the project directory:
git clone https://github.com/leoisufa/MotionAgent.git
cd MotionAgent# Create and activate conda environment
conda create -n motionagent python==3.10 -y
conda activate motionagent
# Install PyTorch with CUDA 12.4 support
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu124
# Install project dependencies
pip install -r requirements.txtMotionAgent relies on external segmentation and grounding models. Follow the steps below to install Grounded-Segment-Anything:
# Navigate to models directory
cd models
# Clone the Grounded-Segment-Anything repository
git clone https://github.com/IDEA-Research/Grounded-Segment-Anything.git
# Enter the cloned directory
cd Grounded-Segment-Anything
# Install Segment Anything
python -m pip install -e segment_anything
# Install Grounding DINO
pip install --no-build-isolation -e GroundingDINOMotionAgent relies on an external monocular depth estimation model. Follow the steps below to install Metric3D:
# Navigate to models directory
cd models
# Clone the Grounded-Segment-Anything repository
git clone https://github.com/YvanYin/Metric3D.gitTo run MotionAgent, please download all pretrained and auxiliary models listed below, and organize them under the ckpts/ directory as shown in the example structure.
Download from 👉 Hugging Face (MotionAgent) and place the files in ckpts.
Download from 👉 Hugging Face (MOFA-Video-Hybrid/stable-video-diffusion-img2vid-xt-1-1) and save the model to ckpts.
Download the grounding model checkpoint using the command below:
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pthThen place it directly under ckpts.
Download the segmentation model using:
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pthThen place it under ckpts.
Download from 👉 Hugging Face (Metric3d) and place the files in ckpts.
Download from 👉 Hugging Face (MOFA-Video-Hybrid/cmp) and save the model to models/cmp/experiments/semiauto_annot/resnet50_vip+mpii_liteflow/checkpoints.
After all downloads and installations, your ckpts folder should look like this:
ckpts/
├── controlnet/
├── stable-video-diffusion-img2vid-xt-1-1/
├── groundingdino_swint_ogc.pth
├── metric_depth_vit_small_800k.pth
└── sam_vit_h_4b8939.pthpython run_agent.pyIf you find MotionAgent useful for your research and applications, please cite using this BibTeX:
@article{liao2025motionagent,
title={Motionagent: Fine-grained controllable video generation via motion field agent},
author={Liao, Xinyao and Zeng, Xianfang and Wang, Liao and Yu, Gang and Lin, Guosheng and Zhang, Chi},
journal={arXiv preprint arXiv:2502.03207},
year={2025}
}We thank the following prior art for their excellent open source work:
