GitHub - HaroldChen19/VistaDPO: [ICML 2025] VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models

VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models

Haojian Huang^1*, Haodong Chen^2*, Shengqiong Wu³, Meng Luo³, Jinlan Fu³, Xinya Du⁴, Hanwang Zhang⁵, Hao Fei^3†

^*Equal Contribution, ^†Corresponding Author
¹HKU, ²HKUST, ³NUS, ⁴UTD, ⁵NTU

If you like our project, please give us a star ⭐ on GitHub for latest update.

Abstract

Large Video Models (LVMs) built upon Large Language Models (LLMs) have shown promise in video understanding but often suffer from misalignment with human intuition and video hallucination issues. To address these challenges, we introduce VistaDPO, a novel framework for Video Hierarchical Spatial-Temporal Direct Preference Optimization. VistaDPO enhances text-video preference alignment across three hierarchical levels: i) Instance Level, aligning overall video content with responses; ii) Temporal Level, aligning video temporal semantics with event descriptions; and iii) Perceptive Level, aligning spatial objects with language tokens. Given the lack of datasets for fine-grained video-language preference alignment, we construct VistaDPO-7k, a dataset of 7.2K QA pairs annotated with chosen and rejected responses, along with spatial-temporal grounding information such as timestamps, keyframes, and bounding boxes. Extensive experiments on benchmarks such as Video Hallucination, Video QA, and Captioning performance tasks demonstrate that VistaDPO significantly improves the performance of existing LVMs, effectively mitigating video-language misalignment and hallucination.

🔥 Update

[2025.05.01]: Our VistaDPO is accepted to ICML'25!.
[2025.04.18]: Released VistaDPO Paper.
[2025.04.03]: Initialized this github repository and released training & inference code of VistaDPO on Video-LLaVA.

🧰 TODO

Release Paper.
Release VistaDPO-7K.
Release VistaDPO model weights.
Release code of VistaDPO on PLLaVA.

📝 Data

Training data

We use our proposed VistaDPO-7k for training, which can be found in HuggingFace. In this repo, we provide a subset of objects for reference in data.

Evaluation data

The evaluation dataset utilized in our work are listed below:

Video Hallucination: VideoHallucer, EventHallusion.
Video QA: MSVD, MSR-VTT, TGIF, ActivityNet, MVBench.
Video Captioning: VideoChatGPT Bench

🚀 Install

Clone this repository and navigate to source folder

cd VistaDPO

Build Environment

echo "Creating conda environment"
conda create -n VistaDPO python=3.10
conda activate VistaDPO

echo "Installing dependencies"
pip install -r requirements.txt

📍 Inference

from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path
from inference.inference_utils import ModelInference, decode2frame
import os
import torch

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

video_path = "./data/videos/_GTwKEPmB-U_5183.mp4"

# CACHE_DIR="/data/VistaDPO/cache"

model_path = "./checkpoints/VistaDPO" 
model_name = get_model_name_from_path(model_path)
tokenizer, model, processor, context_len = load_pretrained_model(model_path, model_base = None, device=device, model_name=model_name)
inference_model = ModelInference(model=model, tokenizer=tokenizer, processor=processor, context_len=context_len)

# our pipeline
frame_dir, _ = os.path.splitext(video_path)
decode2frame(video_path, frame_dir, verbose=True)
question="What is the evident theme in the video?"
response = inference_model.generate(
    question=question,
    modal_path=frame_dir,
    temperature=0,
)
print(response)

# using decord 
response = inference_model.generate(
    question=question,
    modal_path=video_path,
    temperature=0,
    video_decode_backend="decord",
)
print(response)

🚩 Training

VistaDPO training refer to setup and training

bash dpo_scripts/train_dpo.sh

📝 Citation

Please consider citing our paper if our code and benchmark are useful:

@article{huang2025vistadpo,
  title={VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models},
  author={Huang, Haojian and Chen, Haodong and Wu, Shengqiong and Luo, Meng and Fu, Jinlan and Du, Xinya and Zhang, Hanwang and Fei, Hao},
  journal={arXiv preprint arXiv:2504.13122},
  year={2025}
}

🍗 Acknowledgement

Our VistaDPO is developed based on the codebases of VideoLLaVA and PLLaVA, and we would like to thank the developers of both.

📪 Contact

For any question, feel free to email [email protected] or [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
config		config
data		data
data_processing		data_processing
dpo_scripts		dpo_scripts
llava		llava
serve		serve
trl		trl
.DS_Store		.DS_Store
README.md		README.md
inference_dpo.py		inference_dpo.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models

If you like our project, please give us a star ⭐ on GitHub for latest update.

Abstract

🔥 Update

🧰 TODO

📖 Contents

📝 Data

Training data

Evaluation data

🚀 Install

📍 Inference

🚩 Training

📝 Citation

🍗 Acknowledgement

📪 Contact

About

Uh oh!

Releases

Packages

Contributors 2

Languages

HaroldChen19/VistaDPO

Folders and files

Latest commit

History

Repository files navigation

VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models

If you like our project, please give us a star ⭐ on GitHub for latest update.

Abstract

🔥 Update

🧰 TODO

📖 Contents

📝 Data

Training data

Evaluation data

🚀 Install

📍 Inference

🚩 Training

📝 Citation

🍗 Acknowledgement

📪 Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages