Skip to content

NJU-PCALab/MotionSight

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔍 MotionSight: Boosting Fine-Grained Motion Understanding in Multimodal LLMs

Yipeng Du1*    Tiehan Fan1*    Kepan Nan1,2    Rui Xie1,2    Penghao Zhou2    Xiang Li3    Jian Yang1    Zhengheng Yang2    Ying Tai1†   
1 Nanjing University    2 ByteDance    3 Nankai University
* Equal contribution.    † Corresponding author.

Paper Dataset Website

Welcome to MotionSight, a cutting-edge framework for fine-grained motion understanding. This guide provides instructions for environment setup, model preparation, and evaluation.


📣 News

  • [2025.09.26] 📢 MotionChat Release on ModelScope!
  • [2025.06.08] 📢 New Dataset Release on Hugging Face!
  • [2025.06.03] 🚀 Initial Release of MotionSight

📋 Table of Contents

  1. Prerequisites
  2. Environment Setup
  3. Model Preparation
  4. Evaluation
  5. MotionVid Examples
  6. Troubleshooting & FAQ
  7. Citation

🛠️ Prerequisites

  • Operating System: Linux (Ubuntu 20.04/22.04 recommended)
  • Python: 3.8 or higher
  • CUDA: 11.3+ (for GPU acceleration)
  • Hardware: for Qwen2.5VL-7B, GPU with at least 24GB VRAM recommended

🔧 Environment Setup

  1. Clone the Repository

    git clone https://github.com/NJU-PCALab/MotionSight
    cd MotionSight
  2. Install Python Dependencies

    It is highly recommended to use a virtual environment, e.g. conda, uv, venv. Here is an example of using python venv:

    python3 -m venv .venv
    source .venv/bin/activate
    pip install --upgrade pip
    pip install -r requirements.txt
  3. Install Additional Dependencies

    Some dependencies (e.g., flash-attn) may require specific versions. Please refer to requirements.txt and ensure compatibility with your CUDA version.


📦 Model Preparation

  1. Download and Integrate GroundedSAM2

    • Clone the GroundedSAM2 repository:
      git clone https://github.com/IDEA-Research/Grounded-SAM-2
    • Download all required checkpoints as specified in the GroundedSAM2 documentation.
    • Place the entire GroundedSAM2 folder (with checkpoints) into the root of the MotionSight project directory, like MotionSight/Grounded-SAM-2.
  • Make sure track_utils.py is in the GroundedSAM2/ directory:
    mv track_utils.py GroundedSAM2/
  1. Prepare Multimodal Large Language Model (MLLM) Checkpoints

    • Download the MLLM checkpoints (e.g., Qwen2.5-VL-7B-Instruct) and place them in the appropriate directory.
    • You can selectively start the LLM server using lmdeploy, for example:
      lmdeploy serve api_server '/path/to/Qwen2.5-VL-7B-Instruct' --server-port 23333 --tp 1
    • Launch the tracking server (adjust --p and --step as needed for your setup):
      cd GroundedSAM2
      python track_utils.py --p 1 --step 10000
      cd ..
    • Ensure the server is running and accessible at the specified port.
  2. Train using our MotionVid-QA dataset:

    • We used Qwen2-VL-Finetune for fine-tuning. Our public dataset includes the fine-tuning config we used for Qwen2.5-VL. You can follow the instructions at Qwen2-VL-Finetune to configure it accordingly.
    • Download our fine-tuned model at MotionChat.

📊 Evaluation

  • To evaluate the results of MotionSight on the MotionBench or FAVOR-Bench benchmark:
    python -m eval.motionsight.eval_motionbench
    python -m eval.motionsight.eval_favorbench
  • To evaluate our fine-tuned MotionChat:
    python -m eval.motionchat.motionchat --stage 2 --checkpoint "/path/to/checkpoint" --favor_pos "/path/to/FAVOR/"
  • Ensure all evaluation datasets and configuration files are properly set up.

🎬 MotionVid Examples

MotionVid is our specialized module for processing and analyzing fine-grained motion in video content. It provides tools for detailed motion tracking, temporal understanding, and multi-object interaction analysis.

📊 Sample Videos and Analysis

Our framework includes several sample videos that demonstrate MotionVid's capabilities:

Video Description Focus Area
📹 pexels_landscape_landscape_7895832_002.mp4 Train moving through desert landscape Object tracking across complex terrain
📹 pixabay_Beach_Sunrise_37084_001.mp4 Vehicles driving in desert with dust trails Camera movement and environmental effects
📹 v_JNr0oI927ng_t0.13-5.64.mp4 Person on diving board Subtle human motion analysis
📹 -eq3I7gRqTI_000100_000110.mp4 Person mowing lawn with passing vehicle Multi-object interaction
📹 DKZPW.mp4 Person interacting with pet and objects Complex sequence analysis

The module includes a show.json file that pairs videos with question-answer examples.

🎦 Working with Sample Videos

To see these videos in action with MotionSight analysis:

# Run the MotionBench evaluation pipeline
python -m eval.motionsight.eval_motionbench

# When implemented, you'll be able to process individual videos
# python process_video.py --input MotionVid/samples/DKZPW.mp4 --output results/

More detailed examples of video processing will be provided in upcoming documentation.

❓ Troubleshooting & FAQ

  • Q: I encounter CUDA or dependency errors.
    • A: Double-check your CUDA version and ensure all dependencies are installed with compatible versions.
  • Q: The LLM server is not responding.
    • A: Verify that the server is running and the port matches the one specified in your scripts.

📝 Citation

We would be grateful if you would consider citing our paper when MotionSight has been helpful in your research.

@misc{du2025motionsightboostingfinegrainedmotion,
      title={MotionSight: Boosting Fine-Grained Motion Understanding in Multimodal LLMs},
      author={Yipeng Du and Tiehan Fan and Kepan Nan and Rui Xie and Penghao Zhou and Xiang Li and Jian Yang and Zhenheng Yang and Ying Tai},
      year={2025},
      eprint={2506.01674},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.01674},
}

About

MotionSight's official code implementation.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published