Skip to content

[ICML 2025] VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models

Notifications You must be signed in to change notification settings

HaroldChen19/VistaDPO

Repository files navigation

Haojian Huang1*, Haodong Chen2*, Shengqiong Wu3, Meng Luo3, Jinlan Fu3, Xinya Du4, Hanwang Zhang5, Hao Fei3†

*Equal Contribution, Corresponding Author
1HKU, 2HKUST, 3NUS, 4UTD, 5NTU

If you like our project, please give us a star ⭐ on GitHub for latest update.

   

Abstract

Large Video Models (LVMs) built upon Large Language Models (LLMs) have shown promise in video understanding but often suffer from misalignment with human intuition and video hallucination issues. To address these challenges, we introduce VistaDPO, a novel framework for Video Hierarchical Spatial-Temporal Direct Preference Optimization. VistaDPO enhances text-video preference alignment across three hierarchical levels: i) Instance Level, aligning overall video content with responses; ii) Temporal Level, aligning video temporal semantics with event descriptions; and iii) Perceptive Level, aligning spatial objects with language tokens. Given the lack of datasets for fine-grained video-language preference alignment, we construct VistaDPO-7k, a dataset of 7.2K QA pairs annotated with chosen and rejected responses, along with spatial-temporal grounding information such as timestamps, keyframes, and bounding boxes. Extensive experiments on benchmarks such as Video Hallucination, Video QA, and Captioning performance tasks demonstrate that VistaDPO significantly improves the performance of existing LVMs, effectively mitigating video-language misalignment and hallucination.

🔥 Update

  • [2025.05.01]: Our VistaDPO is accepted to ICML'25!.
  • [2025.04.18]: Released VistaDPO Paper.
  • [2025.04.03]: Initialized this github repository and released training & inference code of VistaDPO on Video-LLaVA.

🧰 TODO

  • Release Paper.
  • Release VistaDPO-7K.
  • Release VistaDPO model weights.
  • Release code of VistaDPO on PLLaVA.

📖 Contents

📝 Data

Training data

We use our proposed VistaDPO-7k for training, which can be found in HuggingFace. In this repo, we provide a subset of objects for reference in data.

Evaluation data

The evaluation dataset utilized in our work are listed below:

  • Video Hallucination: VideoHallucer, EventHallusion.
  • Video QA: MSVD, MSR-VTT, TGIF, ActivityNet, MVBench.
  • Video Captioning: VideoChatGPT Bench

🚀 Install

  1. Clone this repository and navigate to source folder
cd VistaDPO
  1. Build Environment
echo "Creating conda environment"
conda create -n VistaDPO python=3.10
conda activate VistaDPO

echo "Installing dependencies"
pip install -r requirements.txt

📍 Inference

from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path
from inference.inference_utils import ModelInference, decode2frame
import os
import torch

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

video_path = "./data/videos/_GTwKEPmB-U_5183.mp4"

# CACHE_DIR="/data/VistaDPO/cache"

model_path = "./checkpoints/VistaDPO" 
model_name = get_model_name_from_path(model_path)
tokenizer, model, processor, context_len = load_pretrained_model(model_path, model_base = None, device=device, model_name=model_name)
inference_model = ModelInference(model=model, tokenizer=tokenizer, processor=processor, context_len=context_len)

# our pipeline
frame_dir, _ = os.path.splitext(video_path)
decode2frame(video_path, frame_dir, verbose=True)
question="What is the evident theme in the video?"
response = inference_model.generate(
    question=question,
    modal_path=frame_dir,
    temperature=0,
)
print(response)

# using decord 
response = inference_model.generate(
    question=question,
    modal_path=video_path,
    temperature=0,
    video_decode_backend="decord",
)
print(response)

🚩 Training

VistaDPO training refer to setup and training

bash dpo_scripts/train_dpo.sh

📝 Citation

Please consider citing our paper if our code and benchmark are useful:

@article{huang2025vistadpo,
  title={VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models},
  author={Huang, Haojian and Chen, Haodong and Wu, Shengqiong and Luo, Meng and Fu, Jinlan and Du, Xinya and Zhang, Hanwang and Fei, Hao},
  journal={arXiv preprint arXiv:2504.13122},
  year={2025}
}

🍗 Acknowledgement

Our VistaDPO is developed based on the codebases of VideoLLaVA and PLLaVA, and we would like to thank the developers of both.

📪 Contact

For any question, feel free to email [email protected] or [email protected].

About

[ICML 2025] VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages