This repository provides the official PyTorch implementation of "Universal Video Temporal Grounding with Generative Multi-modal Large Language Models" (NeurIPS 2025).
🌐 Project Page
- [2025.10] Released the code for data construction, training, and evaluation.
- [2025.09] UniTime accepted to NeurIPS 2025!
- [2025.06] Released the inference code.
- [2025.06] Preprint available on arXiv.
conda create -n UniTime python=3.10
conda activate UniTime
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt-
Download Model Checkpoints
- Obtain the pretrained checkpoints from Qwen2-VL-7B and UniTime.
- Set
model_local_pathto your local path for Qwen2-VL-7B, andmodel_finetune_pathto your UniTime checkpoint.
-
Prepare Input Data
- Create a JSON file for inference as
data/test.json, and specify its path via thedata_pathargument.
- Create a JSON file for inference as
-
Run Inference
- Execute the following command to perform inference. The output results will be saved in the
results/directory.
export CUDA_VISIBLE_DEVICES=0 python inference.py --model_local_path path_to_qwen2vl7B \ --model_finetune_path ckpt/unitime \ --data_path data/test.json \ --output_dir ./results/test \ --nf_short 128 - Execute the following command to perform inference. The output results will be saved in the
-
Download the video and annotation files for each dataset from the corresponding source links.
-
Create the input file following the format below:
[ { "qid": 0, "id": "3MSZA", "annos": [ { "query": "person turn a light on.", "window": [[24.3, 30.4]] } ], "duration": 30.96, "video_path": "./videos/3MSZA.mp4", "mode": "mr", } ]Example construction code for Ego4D-NLQ can be found in
datasets/data_ego4d.py(seeload_data_to_dict()function). Modify it as needed for other datasets. -
(Optional) You may also download preprocessed annotations for each dataset from UniTime-Data.
Execute the following commands in sequence:
# Feature Extraction
bash scripts/feature.sh
# Training
bash scripts/train.sh
# Evaluation
bash scripts/eval.sh
# Metrics
python eval_metrics.py --res ./results/RUN_NAME/results.jsonNote: Modify the arguments marked with ToModify in the code according to the following definitions:
| Argument | Description |
|---|---|
| path_to_qwen2vl7B | Path to the Qwen2-VL-7B model directory |
| path_to_feature_root | Root directory containing features for all datasets |
| path_to_video_root | Root directory path containing all video files |
| path_to_train_data | Path to training set annotation file generated by datasets/data_ego4d.py |
| path_to_val_data | Path to validation set annotation file generated by datasets/data_ego4d.py |
| path_to_test_data | Path to test set annotation file generated by datasets/data_ego4d.py |
| path_to_feature_folder | Subfolder under path_to_feature_root for a specific dataset |
| RUN_NAME | Experiment identifier/name for this training run |
If you use this code and data for your research or project, please cite:
@inproceedings{unitime2025,
title={Universal Video Temporal Grounding with Generative Multi-modal Large Language Models},
author={Li, Zeqian and Di, Shangzhe and Zhai, Zhonghua and Huang, Weilin and Wang, Yanfeng and Xie, Weidi},
booktitle={NeurIPS},
year={2025}
}
This project builds upon several excellent open-source efforts:
For questions, please contact: [email protected].
