🔍 MotionSight: Boosting Fine-Grained Motion Understanding in Multimodal LLMs

Yipeng Du^1* Tiehan Fan^1* Kepan Nan^1,2 Rui Xie^1,2 Penghao Zhou² Xiang Li³ Jian Yang¹ Zhengheng Yang² Ying Tai^1†

¹ Nanjing University ² ByteDance ³ Nankai University

* Equal contribution. † Corresponding author.

Welcome to MotionSight, a cutting-edge framework for fine-grained motion understanding. This guide provides instructions for environment setup, model preparation, and evaluation.

📣 News

[2025.09.26] 📢 MotionChat Release on ModelScope!
[2025.06.08] 📢 New Dataset Release on Hugging Face!
[2025.06.03] 🚀 Initial Release of MotionSight

🛠️ Prerequisites

Operating System: Linux (Ubuntu 20.04/22.04 recommended)
Python: 3.8 or higher
CUDA: 11.3+ (for GPU acceleration)
Hardware: for Qwen2.5VL-7B, GPU with at least 24GB VRAM recommended

🔧 Environment Setup

Clone the Repository

git clone https://github.com/NJU-PCALab/MotionSight
cd MotionSight

Install Python Dependencies

It is highly recommended to use a virtual environment, e.g. conda, uv, venv. Here is an example of using python venv:
```
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
```
Install Additional Dependencies

Some dependencies (e.g., flash-attn) may require specific versions. Please refer to requirements.txt and ensure compatibility with your CUDA version.

📦 Model Preparation

Download and Integrate GroundedSAM2
- Clone the GroundedSAM2 repository:
```
git clone https://github.com/IDEA-Research/Grounded-SAM-2
```
- Download all required checkpoints as specified in the GroundedSAM2 documentation.
- Place the entire GroundedSAM2 folder (with checkpoints) into the root of the MotionSight project directory, like MotionSight/Grounded-SAM-2.

Make sure track_utils.py is in the GroundedSAM2/ directory:
```
mv track_utils.py GroundedSAM2/
```

Prepare Multimodal Large Language Model (MLLM) Checkpoints
- Download the MLLM checkpoints (e.g., Qwen2.5-VL-7B-Instruct) and place them in the appropriate directory.
- You can selectively start the LLM server using lmdeploy, for example:
```
lmdeploy serve api_server '/path/to/Qwen2.5-VL-7B-Instruct' --server-port 23333 --tp 1
```
- Launch the tracking server (adjust --p and --step as needed for your setup):
```
cd GroundedSAM2
python track_utils.py --p 1 --step 10000
cd ..
```
- Ensure the server is running and accessible at the specified port.
Train using our MotionVid-QA dataset:
- We used Qwen2-VL-Finetune for fine-tuning. Our public dataset includes the fine-tuning config we used for Qwen2.5-VL. You can follow the instructions at Qwen2-VL-Finetune to configure it accordingly.
- Download our fine-tuned model at MotionChat.

📊 Evaluation

To evaluate the results of MotionSight on the MotionBench or FAVOR-Bench benchmark:

python -m eval.motionsight.eval_motionbench
python -m eval.motionsight.eval_favorbench

To evaluate our fine-tuned MotionChat:

python -m eval.motionchat.motionchat --stage 2 --checkpoint "/path/to/checkpoint" --favor_pos "/path/to/FAVOR/"

Ensure all evaluation datasets and configuration files are properly set up.

🎬 MotionVid Examples

MotionVid is our specialized module for processing and analyzing fine-grained motion in video content. It provides tools for detailed motion tracking, temporal understanding, and multi-object interaction analysis.

📊 Sample Videos and Analysis

Our framework includes several sample videos that demonstrate MotionVid's capabilities:

Video	Description	Focus Area
📹 pexels_landscape_landscape_7895832_002.mp4	Train moving through desert landscape	Object tracking across complex terrain
📹 pixabay_Beach_Sunrise_37084_001.mp4	Vehicles driving in desert with dust trails	Camera movement and environmental effects
📹 v_JNr0oI927ng_t0.13-5.64.mp4	Person on diving board	Subtle human motion analysis
📹 -eq3I7gRqTI_000100_000110.mp4	Person mowing lawn with passing vehicle	Multi-object interaction
📹 DKZPW.mp4	Person interacting with pet and objects	Complex sequence analysis

The module includes a show.json file that pairs videos with question-answer examples.

🎦 Working with Sample Videos

To see these videos in action with MotionSight analysis:

# Run the MotionBench evaluation pipeline
python -m eval.motionsight.eval_motionbench

# When implemented, you'll be able to process individual videos
# python process_video.py --input MotionVid/samples/DKZPW.mp4 --output results/

More detailed examples of video processing will be provided in upcoming documentation.

❓ Troubleshooting & FAQ

Q: I encounter CUDA or dependency errors.
- A: Double-check your CUDA version and ensure all dependencies are installed with compatible versions.
Q: The LLM server is not responding.
- A: Verify that the server is running and the port matches the one specified in your scripts.

📝 Citation

We would be grateful if you would consider citing our paper when MotionSight has been helpful in your research.

@misc{du2025motionsightboostingfinegrainedmotion,
      title={MotionSight: Boosting Fine-Grained Motion Understanding in Multimodal LLMs},
      author={Yipeng Du and Tiehan Fan and Kepan Nan and Rui Xie and Penghao Zhou and Xiang Li and Jian Yang and Zhenheng Yang and Ying Tai},
      year={2025},
      eprint={2506.01674},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.01674},
}

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
Grounded-SAM-2		Grounded-SAM-2
MotionVid/samples		MotionVid/samples
eval		eval
ft_config		ft_config
utils		utils
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🔍 MotionSight: Boosting Fine-Grained Motion Understanding in Multimodal LLMs

📣 News

📋 Table of Contents

🛠️ Prerequisites

🔧 Environment Setup

📦 Model Preparation

📊 Evaluation

🎬 MotionVid Examples

📊 Sample Videos and Analysis

🎦 Working with Sample Videos

❓ Troubleshooting & FAQ

📝 Citation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

NJU-PCALab/MotionSight

Folders and files

Latest commit

History

Repository files navigation

🔍 MotionSight: Boosting Fine-Grained Motion Understanding in Multimodal LLMs

📣 News

📋 Table of Contents

🛠️ Prerequisites

🔧 Environment Setup

📦 Model Preparation

📊 Evaluation

🎬 MotionVid Examples

📊 Sample Videos and Analysis

🎦 Working with Sample Videos

❓ Troubleshooting & FAQ

📝 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages