Our paper is accepted by ICCV 2025. The lastest version of the paper is available here: link .
- Clone this repository and navigate to UV-CoT folder or download the code.
git clone https://github.com/UV-CoT
cd UV-CoT- Install package
conda create -n uv-cot python=3.10 -y
conda activate uv-cot
pip install -e .- Install required spaCy model
wget https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.7.3/en_core_web_trf-3.7.3.tar.gz
pip install en_core_web_trf-3.7.3.tar.gz- Environment Setup
Please download fine-tuned Llama3 8B models: split model and question transformation model, and store them in the ./models/llama3_split folder and the ./models/llama3_changeq folder respectively.
- Model Feedback
The following script demonstrates using the LLaVA-v1.5-7b model to generate candidate answers and the OmniLMM 12B model to provide feedback.
mkdir ./results
bash ./script/data_gen/run_data_pipeline_llava15_omni.shIf you want to evaluate according to final answers, please refer to:
bash ./script/data_gen/run_data_pipeline_llava15_omni_next.shIf you have multi steps CoT, please refer to:
bash ./script/data_gen/run_data_pipeline_llava15_omni_divide.shIf you want to use self-evaluated method , please refer to:
bash ./script/data_gen/run_data_pipeline_llava15_self_evaluated.sh- A Toy Example
We provide a toy example in the folder cot_one. Process your instruction set into the same format before generating the preference data.
- Prepare data
-
GQA: images
-
TextVQA: train_val_images
-
Visual7W: repo
-
Flickr30k: homepage
-
DocVQA: homepage
-
InfographicsVQA: homepage
-
VSR: images
-
DUDE: images
-
SROIE: homepage
-
V* Bench: homepage
After downloading all of them, organize the data as follows in ./playground/data,
├── coco
│ └── train2017
│ └── train2014
├── gqa
│ └── images
├── ocr_vqa
│ └── images
├── textvqa
│ └── train_images
└── v7w
│ └── images
└── flickr30k
│ └── images
└── cot
│ └── flickr30k
│ └── docvqa
│ └── gqa
│ └── infographicsvqa
│ └── textvqa
│ └── vsr
│ └── dude
│ └── sroie
│ └── vstar
- Training
Here, we provide a training script to train the model in 1 iteration. The max_step parameter should be adjusted according to the amount of your data.
Run the following command to start fully fine-tuning.
bash ./script/train/llava15_train.sh- Iterative alignment
To reproduce the iterative training process in the paper, you need to do the following steps for 4 times:
-
S1. Data generation.
Follow the instructions in Preference Data Curation to generate preference pairs for the base model. Convert the generated jsonl file to huggingface parquet.
-
S2. Change training config.
In dataset code, replace data_path here to your data path.
In training script, replace
--data_dirwith a new directory, replace--model_name_or_pathwith the base model path, set--max_stepto the number of steps for 4 epoch, set--save_stepsto the number of steps for 1/4 epoch. -
S3. Do DPO training.
Run the training script to train the base model.
-
S4. Choose base model for next iteration.
- Inference on both training datasets and zero-shot datasets,
UV-CoTcan be changed to other model names saved in the ./checkpoints/
bash scripts/v1_5/eval/cot_benchmark.sh UV-CoT- Inference for ablation study
bash scripts/v1_5/eval/cot_benchmark_ablations.sh UV-CoT- Obtain the score using GPT-4o, the API KEY need to be set in
llava/eval/eval_cot_score.py
bash scripts/v1_5/eval/cot_score.sh UV-CoTIf our work assists your research, feel free to give us a star ⭐ or cite us using:
@misc{zhao2025unsupervisedvisualchainofthoughtreasoning,
title={Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization},
author={Kesen Zhao and Beier Zhu and Qianru Sun and Hanwang Zhang},
year={2025},
eprint={2504.18397},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.18397},
}