MotionAgent: Fine-grained Controllable Video Generation via
Motion Field Agent

International Conference on Computer Vision, ICCV 2025.

Xinyao Liao^1,2, Xianfang Zeng², Liao Wang², Gang Yu^2*, Guosheng Lin^1*, Chi Zhang³

¹ Nanyang Technological University ² StepFun ³ Westlake University

🧩 Overview

MotionAgent is a novel framework that enables fine-grained motion control for text-guided image-to-video generation. At its core is a motion field agent that parses motion information in text prompts and converts it into explicit object trajectories and camera extrinsics. These motion representations are analytically integrated into a unified optical flow, which conditions a diffusion-based image-to-video model to generate videos with precise and flexible motion control. An optional rethinking step further refines motion alignment by iteratively correcting the agent’s previous actions.

🎥 Demo

Click the image above to watch the full video on YouTube 🎬

🛠️ Dependencies and Installation

Follow the steps below to set up MotionAgent and run the demo smoothly 💫

🔹 1. Clone the Repository

Clone the official GitHub repository and enter the project directory:

git clone https://github.com/leoisufa/MotionAgent.git
cd MotionAgent

🔹 2. Environment Setup

# Create and activate conda environment
conda create -n motionagent python==3.10 -y
conda activate motionagent

# Install PyTorch with CUDA 12.4 support
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu124

# Install project dependencies
pip install -r requirements.txt

🔹 3. Install Grounded-Segment-Anything Dependencies

MotionAgent relies on external segmentation and grounding models. Follow the steps below to install Grounded-Segment-Anything:

# Navigate to models directory
cd models

# Clone the Grounded-Segment-Anything repository
git clone https://github.com/IDEA-Research/Grounded-Segment-Anything.git

# Enter the cloned directory
cd Grounded-Segment-Anything

# Install Segment Anything
python -m pip install -e segment_anything

# Install Grounding DINO
pip install --no-build-isolation -e GroundingDINO

🔹 4. Install Metric3D Dependencies

MotionAgent relies on an external monocular depth estimation model. Follow the steps below to install Metric3D:

# Navigate to models directory
cd models

# Clone the Grounded-Segment-Anything repository
git clone https://github.com/YvanYin/Metric3D.git

🧱 Download Models

To run MotionAgent, please download all pretrained and auxiliary models listed below, and organize them under the ckpts/ directory as shown in the example structure.

1️⃣ Optical Flow ControlNet Weights

Download from 👉 Hugging Face (MotionAgent) and place the files in ckpts.

2️⃣ Stable Video Diffusion

Download from 👉 Hugging Face (MOFA-Video-Hybrid/stable-video-diffusion-img2vid-xt-1-1) and save the model to ckpts.

3️⃣ Grounding DINO

Download the grounding model checkpoint using the command below:

wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth

Then place it directly under ckpts.

4️⃣ Segment Anything

Download the segmentation model using:

wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth

Then place it under ckpts.

5️⃣ Metric Depth Estimator

Download from 👉 Hugging Face (Metric3d) and place the files in ckpts.

6️⃣ CMP

Download from 👉 Hugging Face (MOFA-Video-Hybrid/cmp) and save the model to models/cmp/experiments/semiauto_annot/resnet50_vip+mpii_liteflow/checkpoints.

After all downloads and installations, your ckpts folder should look like this:

ckpts/
├── controlnet/
├── stable-video-diffusion-img2vid-xt-1-1/
├── groundingdino_swint_ogc.pth
├── metric_depth_vit_small_800k.pth
└── sam_vit_h_4b8939.pth

🚀 Running the Demos

python run_agent.py

🔗 BibTeX

If you find MotionAgent useful for your research and applications, please cite using this BibTeX:

@article{liao2025motionagent,
  title={Motionagent: Fine-grained controllable video generation via motion field agent},
  author={Liao, Xinyao and Zeng, Xianfang and Wang, Liao and Yu, Gang and Lin, Guosheng and Zhang, Chi},
  journal={arXiv preprint arXiv:2502.03207},
  year={2025}
}

🙏 Acknowledgements

We thank the following prior art for their excellent open source work:

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
agent		agent
assets		assets
models		models
pipeline		pipeline
utils		utils
README.md		README.md
requirements.txt		requirements.txt
run_agent.py		run_agent.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MotionAgent: Fine-grained Controllable Video Generation via
Motion Field Agent

🧩 Overview

🎥 Demo

🛠️ Dependencies and Installation

🔹 1. Clone the Repository

🔹 2. Environment Setup

🔹 3. Install Grounded-Segment-Anything Dependencies

🔹 4. Install Metric3D Dependencies

🧱 Download Models

1️⃣ Optical Flow ControlNet Weights

2️⃣ Stable Video Diffusion

3️⃣ Grounding DINO

4️⃣ Segment Anything

5️⃣ Metric Depth Estimator

6️⃣ CMP

🚀 Running the Demos

🔗 BibTeX

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Languages

leoisufa/MotionAgent

Folders and files

Latest commit

History

Repository files navigation

MotionAgent: Fine-grained Controllable Video Generation via Motion Field Agent

🧩 Overview

🎥 Demo

🛠️ Dependencies and Installation

🔹 1. Clone the Repository

🔹 2. Environment Setup

🔹 3. Install Grounded-Segment-Anything Dependencies

🔹 4. Install Metric3D Dependencies

🧱 Download Models

1️⃣ Optical Flow ControlNet Weights

2️⃣ Stable Video Diffusion

3️⃣ Grounding DINO

4️⃣ Segment Anything

5️⃣ Metric Depth Estimator

6️⃣ CMP

🚀 Running the Demos

🔗 BibTeX

🙏 Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

MotionAgent: Fine-grained Controllable Video Generation via
Motion Field Agent

Packages