Accurate temporal prediction is the bridge between comprehensive scene understanding and embodied artificial intelligence. However, predicting multiple fine-grained states of a scene at multiple temporal scales is difficult for vision-language models. We formalize the Multi-Scale Temporal Prediction (MSTP) task in surgical and general scenes by decomposing multi-scale into two orthogonal dimensions. We further propose a method, Incremental Generation and Multi-agent Collaboration to tackle this important and challenging task.
Tested with 2 × NVIDIA H200 Tensor Core GPUs
git clone https://github.com/jinlab-imvr/MSTP.git
cd MSTP/LLaMA-Factory
# Create and activate the environment
conda create -n mstp python=3.10 -y
conda activate mstp
# Install core dependencies
pip install wheel
pip install -e ".[torch,metrics]" --no-build-isolation
# (Optional) Choose transformers version by model family
# For Qwen2.5-VL series pretrained models:
pip install transformers==4.51
# For InternVL3 and gemma-3 series pretrained models:
pip install transformers==4.52
# Additional requirements
pip install -r requirements.txtThe dataset provided in the paper can be downloaded for verification.
- We use 8 video frames from the GraSP dataset for training and 4 video frames for testing.
- Please run
make_augment_all.pyto perform data augmentation. - If you want to obtain the processed labels for the MSTP task used in the paper:
- Fill out this form to obtain the download link.
- After download, extract the compressed file to
LLaMA-Factory/data/.
If you want to customize the dataset, please refer to the data instructions.
Download the pretrained visual generation model weights (formerly SD weights) we provide, and place them in the pretrained directory:
Download the LoRA weights of the pretrained decision-making models (formerly VL models) we trained, and place them in the LoRA directory:
cd MSTP/LLaMA-Factory
# Use Qwen2.5-VL-7B-Instruct as the decision-making model
python ../TP_IG.py --cir 5 --time 1 --start 0 --end 200 \
--data_dir dir_to_dataset --sd_model large --mode test \
--model_name Qwen2.5-VL-7B-Instruct
# Use gemma-3-4b-it as the decision-making model
python ../TP_IG.py --cir 5 --time 1 --start 0 --end 200 \
--data_dir dir_to_dataset --sd_model large --mode test \
--model_name gemma-3-4b-it
# Use InternVL3-8B-hf as the decision-making model
python ../TP_IG.py --cir 5 --time 1 --start 0 --end 200 \
--data_dir dir_to_dataset --sd_model large --mode test \
--model_name InternVL3-8B-hfTo fine-tune the SD3.5-based visual generation model, please refer to the official
Stable Diffusion 3.5 fine-tuning guide.
This project uses LoRA to train the decision-making models (formerly VL models).
cd MSTP/LLaMA-Factory
DISABLE_VERSION_CHECK=1 llamafactory-cli train \
examples/train_lora/Qwen2.5-VL-7B-Instruct/qwen2.5vl_lora_sft_chain1_1s.yamlDISABLE_VERSION_CHECK=1 llamafactory-cli export \
examples/merge_lora/Qwen2.5-VL-7B-Instruct/qwen2.5vl_lora_sft_chain1_1s.yamlGenerate decision-making model results in batches:
DISABLE_VERSION_CHECK=1 llamafactory-cli train \
examples/predict/Qwen2.5-VL-7B-Instruct/qwen2.5vl_lora_sft_chain1_1s.yamlIf you find this code useful for your research, please cite:
@misc{zeng2025multiscaletemporalpredictionincremental,
title = {Multi-scale Temporal Prediction via Incremental Generation and Multi-agent Collaboration},
author = {Zhitao Zeng and Guojian Yuan and Junyuan Mao and Yuxuan Wang and Xiaoshuang Jia and Yueming Jin},
year = {2025},
eprint = {2509.17429},
archivePrefix= {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2509.17429},
}