MSTP: Multi-scale Temporal Prediction via Incremental Generation and Multi-agent Collaboration

Accurate temporal prediction is the bridge between comprehensive scene understanding and embodied artificial intelligence. However, predicting multiple fine-grained states of a scene at multiple temporal scales is difficult for vision-language models. We formalize the Multi-Scale Temporal Prediction (MSTP) task in surgical and general scenes by decomposing multi-scale into two orthogonal dimensions. We further propose a method, Incremental Generation and Multi-agent Collaboration to tackle this important and challenging task.

Installation

Tested with 2 × NVIDIA H200 Tensor Core GPUs

git clone https://github.com/jinlab-imvr/MSTP.git
cd MSTP/LLaMA-Factory

# Create and activate the environment
conda create -n mstp python=3.10 -y
conda activate mstp

# Install core dependencies
pip install wheel
pip install -e ".[torch,metrics]" --no-build-isolation

# (Optional) Choose transformers version by model family
# For Qwen2.5-VL series pretrained models:
pip install transformers==4.51
# For InternVL3 and gemma-3 series pretrained models:
pip install transformers==4.52

# Additional requirements
pip install -r requirements.txt

Dataset

The dataset provided in the paper can be downloaded for verification.

We use 8 video frames from the GraSP dataset for training and 4 video frames for testing.
Please run make_augment_all.py to perform data augmentation.
If you want to obtain the processed labels for the MSTP task used in the paper:
- Fill out this form to obtain the download link.
- After download, extract the compressed file to LLaMA-Factory/data/.

If you want to customize the dataset, please refer to the data instructions.

Weights of Visual Generation Module

Download the pretrained visual generation model weights (formerly SD weights) we provide, and place them in the pretrained directory:

LoRA Weights of Decision-making Module

Download the LoRA weights of the pretrained decision-making models (formerly VL models) we trained, and place them in the LoRA directory:

Inference of Temporal Prediction via Incremental Generation

cd MSTP/LLaMA-Factory

# Use Qwen2.5-VL-7B-Instruct as the decision-making model
python ../TP_IG.py --cir 5 --time 1 --start 0 --end 200 \
    --data_dir dir_to_dataset --sd_model large --mode test \
    --model_name Qwen2.5-VL-7B-Instruct

# Use gemma-3-4b-it as the decision-making model
python ../TP_IG.py --cir 5 --time 1 --start 0 --end 200 \
    --data_dir dir_to_dataset --sd_model large --mode test \
    --model_name gemma-3-4b-it

# Use InternVL3-8B-hf as the decision-making model
python ../TP_IG.py --cir 5 --time 1 --start 0 --end 200 \
    --data_dir dir_to_dataset --sd_model large --mode test \
    --model_name InternVL3-8B-hf

Training of Visual Generation Module

To fine-tune the SD3.5-based visual generation model, please refer to the official
Stable Diffusion 3.5 fine-tuning guide.

Training of Decision-making Module

This project uses LoRA to train the decision-making models (formerly VL models).

cd MSTP/LLaMA-Factory
DISABLE_VERSION_CHECK=1 llamafactory-cli train \
    examples/train_lora/Qwen2.5-VL-7B-Instruct/qwen2.5vl_lora_sft_chain1_1s.yaml

DISABLE_VERSION_CHECK=1 llamafactory-cli export \
    examples/merge_lora/Qwen2.5-VL-7B-Instruct/qwen2.5vl_lora_sft_chain1_1s.yaml

Validation of MSTP

Generate decision-making model results in batches:

DISABLE_VERSION_CHECK=1 llamafactory-cli train \
    examples/predict/Qwen2.5-VL-7B-Instruct/qwen2.5vl_lora_sft_chain1_1s.yaml

Citing Our Work

If you find this code useful for your research, please cite:

@misc{zeng2025multiscaletemporalpredictionincremental,
      title        = {Multi-scale Temporal Prediction via Incremental Generation and Multi-agent Collaboration},
      author       = {Zhitao Zeng and Guojian Yuan and Junyuan Mao and Yuxuan Wang and Xiaoshuang Jia and Yueming Jin},
      year         = {2025},
      eprint       = {2509.17429},
      archivePrefix= {arXiv},
      primaryClass = {cs.CV},
      url          = {https://arxiv.org/abs/2509.17429},
}

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
LLaMA-Factory		LLaMA-Factory
LICENSE		LICENSE
README.md		README.md
TP_IG.py		TP_IG.py
make_augment_all.py		make_augment_all.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MSTP: Multi-scale Temporal Prediction via Incremental Generation and Multi-agent Collaboration

Installation

Dataset

Weights of Visual Generation Module

LoRA Weights of Decision-making Module

Inference of Temporal Prediction via Incremental Generation

Training of Visual Generation Module

Training of Decision-making Module

Validation of MSTP

Citing Our Work

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

jinlab-imvr/MSTP

Folders and files

Latest commit

History

Repository files navigation

MSTP: Multi-scale Temporal Prediction via Incremental Generation and Multi-agent Collaboration

Installation

Dataset

Weights of Visual Generation Module

LoRA Weights of Decision-making Module

Inference of Temporal Prediction via Incremental Generation

Training of Visual Generation Module

Training of Decision-making Module

Validation of MSTP

Citing Our Work

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages