Haojian Huang1*, Haodong Chen2*, Shengqiong Wu3, Meng Luo3, Jinlan Fu3, Xinya Du4, Hanwang Zhang5, Hao Fei3†
*Equal Contribution, †Corresponding Author
1HKU, 2HKUST, 3NUS, 4UTD, 5NTU
Large Video Models (LVMs) built upon Large Language Models (LLMs) have shown promise in video understanding but often suffer from misalignment with human intuition and video hallucination issues. To address these challenges, we introduce VistaDPO, a novel framework for Video Hierarchical Spatial-Temporal Direct Preference Optimization. VistaDPO enhances text-video preference alignment across three hierarchical levels: i) Instance Level, aligning overall video content with responses; ii) Temporal Level, aligning video temporal semantics with event descriptions; and iii) Perceptive Level, aligning spatial objects with language tokens. Given the lack of datasets for fine-grained video-language preference alignment, we construct VistaDPO-7k, a dataset of 7.2K QA pairs annotated with chosen and rejected responses, along with spatial-temporal grounding information such as timestamps, keyframes, and bounding boxes. Extensive experiments on benchmarks such as Video Hallucination, Video QA, and Captioning performance tasks demonstrate that VistaDPO significantly improves the performance of existing LVMs, effectively mitigating video-language misalignment and hallucination.
![]() |
- [2025.05.01]: Our VistaDPO is accepted to ICML'25!.
- [2025.04.18]: Released VistaDPO Paper.
- [2025.04.03]: Initialized this github repository and released training & inference code of VistaDPO on Video-LLaVA.
- Release Paper.
- Release VistaDPO-7K.
- Release VistaDPO model weights.
- Release code of VistaDPO on PLLaVA.
We use our proposed VistaDPO-7k for training, which can be found in HuggingFace. In this repo, we provide a subset of objects for reference in data.
The evaluation dataset utilized in our work are listed below:
- Video Hallucination: VideoHallucer, EventHallusion.
- Video QA: MSVD, MSR-VTT, TGIF, ActivityNet, MVBench.
- Video Captioning: VideoChatGPT Bench
- Clone this repository and navigate to source folder
cd VistaDPO- Build Environment
echo "Creating conda environment"
conda create -n VistaDPO python=3.10
conda activate VistaDPO
echo "Installing dependencies"
pip install -r requirements.txtfrom llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path
from inference.inference_utils import ModelInference, decode2frame
import os
import torch
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
video_path = "./data/videos/_GTwKEPmB-U_5183.mp4"
# CACHE_DIR="/data/VistaDPO/cache"
model_path = "./checkpoints/VistaDPO"
model_name = get_model_name_from_path(model_path)
tokenizer, model, processor, context_len = load_pretrained_model(model_path, model_base = None, device=device, model_name=model_name)
inference_model = ModelInference(model=model, tokenizer=tokenizer, processor=processor, context_len=context_len)
# our pipeline
frame_dir, _ = os.path.splitext(video_path)
decode2frame(video_path, frame_dir, verbose=True)
question="What is the evident theme in the video?"
response = inference_model.generate(
question=question,
modal_path=frame_dir,
temperature=0,
)
print(response)
# using decord
response = inference_model.generate(
question=question,
modal_path=video_path,
temperature=0,
video_decode_backend="decord",
)
print(response)VistaDPO training refer to setup and training
bash dpo_scripts/train_dpo.shPlease consider citing our paper if our code and benchmark are useful:
@article{huang2025vistadpo,
title={VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models},
author={Huang, Haojian and Chen, Haodong and Wu, Shengqiong and Luo, Meng and Fu, Jinlan and Du, Xinya and Zhang, Hanwang and Fei, Hao},
journal={arXiv preprint arXiv:2504.13122},
year={2025}
}Our VistaDPO is developed based on the codebases of VideoLLaVA and PLLaVA, and we would like to thank the developers of both.
For any question, feel free to email [email protected] or [email protected].
