Welcome to MotionSight, a cutting-edge framework for fine-grained motion understanding. This guide provides instructions for environment setup, model preparation, and evaluation.
- [2025.09.26] 📢 MotionChat Release on ModelScope!
- [2025.06.08] 📢 New Dataset Release on Hugging Face!
- [2025.06.03] 🚀 Initial Release of MotionSight
- Prerequisites
- Environment Setup
- Model Preparation
- Evaluation
- MotionVid Examples
- Troubleshooting & FAQ
- Citation
- Operating System: Linux (Ubuntu 20.04/22.04 recommended)
- Python: 3.8 or higher
- CUDA: 11.3+ (for GPU acceleration)
- Hardware: for Qwen2.5VL-7B, GPU with at least 24GB VRAM recommended
-
Clone the Repository
git clone https://github.com/NJU-PCALab/MotionSight cd MotionSight -
Install Python Dependencies
It is highly recommended to use a virtual environment, e.g. conda, uv, venv. Here is an example of using python venv:
python3 -m venv .venv source .venv/bin/activate pip install --upgrade pip pip install -r requirements.txt -
Install Additional Dependencies
Some dependencies (e.g.,
flash-attn) may require specific versions. Please refer torequirements.txtand ensure compatibility with your CUDA version.
-
Download and Integrate GroundedSAM2
- Clone the GroundedSAM2 repository:
git clone https://github.com/IDEA-Research/Grounded-SAM-2
- Download all required checkpoints as specified in the GroundedSAM2 documentation.
- Place the entire
GroundedSAM2folder (with checkpoints) into the root of the MotionSight project directory, likeMotionSight/Grounded-SAM-2.
- Clone the GroundedSAM2 repository:
- Make sure
track_utils.pyis in theGroundedSAM2/directory:mv track_utils.py GroundedSAM2/
-
Prepare Multimodal Large Language Model (MLLM) Checkpoints
- Download the MLLM checkpoints (e.g., Qwen2.5-VL-7B-Instruct) and place them in the appropriate directory.
- You can selectively start the LLM server using lmdeploy, for example:
lmdeploy serve api_server '/path/to/Qwen2.5-VL-7B-Instruct' --server-port 23333 --tp 1 - Launch the tracking server (adjust
--pand--stepas needed for your setup):cd GroundedSAM2 python track_utils.py --p 1 --step 10000 cd ..
- Ensure the server is running and accessible at the specified port.
-
Train using our MotionVid-QA dataset:
- We used Qwen2-VL-Finetune for fine-tuning. Our public dataset includes the fine-tuning config we used for Qwen2.5-VL. You can follow the instructions at Qwen2-VL-Finetune to configure it accordingly.
- Download our fine-tuned model at MotionChat.
- To evaluate the results of MotionSight on the MotionBench or FAVOR-Bench benchmark:
python -m eval.motionsight.eval_motionbench python -m eval.motionsight.eval_favorbench
- To evaluate our fine-tuned MotionChat:
python -m eval.motionchat.motionchat --stage 2 --checkpoint "/path/to/checkpoint" --favor_pos "/path/to/FAVOR/"
- Ensure all evaluation datasets and configuration files are properly set up.
MotionVid is our specialized module for processing and analyzing fine-grained motion in video content. It provides tools for detailed motion tracking, temporal understanding, and multi-object interaction analysis.
Our framework includes several sample videos that demonstrate MotionVid's capabilities:
| Video | Description | Focus Area |
|---|---|---|
| 📹 pexels_landscape_landscape_7895832_002.mp4 | Train moving through desert landscape | Object tracking across complex terrain |
| 📹 pixabay_Beach_Sunrise_37084_001.mp4 | Vehicles driving in desert with dust trails | Camera movement and environmental effects |
| 📹 v_JNr0oI927ng_t0.13-5.64.mp4 | Person on diving board | Subtle human motion analysis |
| 📹 -eq3I7gRqTI_000100_000110.mp4 | Person mowing lawn with passing vehicle | Multi-object interaction |
| 📹 DKZPW.mp4 | Person interacting with pet and objects | Complex sequence analysis |
The module includes a show.json file that pairs videos with question-answer examples.
To see these videos in action with MotionSight analysis:
# Run the MotionBench evaluation pipeline
python -m eval.motionsight.eval_motionbench
# When implemented, you'll be able to process individual videos
# python process_video.py --input MotionVid/samples/DKZPW.mp4 --output results/More detailed examples of video processing will be provided in upcoming documentation.
- Q: I encounter CUDA or dependency errors.
- A: Double-check your CUDA version and ensure all dependencies are installed with compatible versions.
- Q: The LLM server is not responding.
- A: Verify that the server is running and the port matches the one specified in your scripts.
We would be grateful if you would consider citing our paper when MotionSight has been helpful in your research.
@misc{du2025motionsightboostingfinegrainedmotion,
title={MotionSight: Boosting Fine-Grained Motion Understanding in Multimodal LLMs},
author={Yipeng Du and Tiehan Fan and Kepan Nan and Rui Xie and Penghao Zhou and Xiang Li and Jian Yang and Zhenheng Yang and Ying Tai},
year={2025},
eprint={2506.01674},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.01674},
}