UV-CoT: unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization

News 🎉🎉🎉

Our paper is accepted by ICCV 2025. The lastest version of the paper is available here: link .

Links

Project page: link
We have released the checkpoint of UV-CoT at our Hugging Face page: link

Overview

Visualization

Install

Clone this repository and navigate to UV-CoT folder or download the code.

git clone https://github.com/UV-CoT
cd UV-CoT

Install package

conda create -n uv-cot python=3.10 -y
conda activate uv-cot
pip install -e .

Install required spaCy model

wget https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.7.3/en_core_web_trf-3.7.3.tar.gz
pip install en_core_web_trf-3.7.3.tar.gz

Preference Data Curation

Environment Setup

Please download fine-tuned Llama3 8B models: split model and question transformation model, and store them in the ./models/llama3_split folder and the ./models/llama3_changeq folder respectively.

Model Feedback

The following script demonstrates using the LLaVA-v1.5-7b model to generate candidate answers and the OmniLMM 12B model to provide feedback.

mkdir ./results
bash ./script/data_gen/run_data_pipeline_llava15_omni.sh

If you want to evaluate according to final answers, please refer to:

bash ./script/data_gen/run_data_pipeline_llava15_omni_next.sh

If you have multi steps CoT, please refer to:

bash ./script/data_gen/run_data_pipeline_llava15_omni_divide.sh

If you want to use self-evaluated method , please refer to:

bash ./script/data_gen/run_data_pipeline_llava15_self_evaluated.sh

A Toy Example

We provide a toy example in the folder cot_one. Process your instruction set into the same format before generating the preference data.

Train

Prepare data

COCO: train2017 train2014
GQA: images
TextVQA: train_val_images
VisualGenome: part1, part2
Visual7W: repo
Flickr30k: homepage
DocVQA: homepage
InfographicsVQA: homepage
VSR: images
DUDE: images
SROIE: homepage
V* Bench: homepage

After downloading all of them, organize the data as follows in ./playground/data,

├── coco
│   └── train2017
│   └── train2014
├── gqa
│   └── images
├── ocr_vqa
│   └── images
├── textvqa
│   └── train_images
└── v7w
│   └── images
└── flickr30k
│   └── images
└── cot
│   └── flickr30k
│   └── docvqa
│   └── gqa
│   └── infographicsvqa
│   └── textvqa
│   └── vsr
│   └── dude
│   └── sroie
│   └── vstar

Training

Here, we provide a training script to train the model in 1 iteration. The max_step parameter should be adjusted according to the amount of your data.

Run the following command to start fully fine-tuning.

bash ./script/train/llava15_train.sh

Iterative alignment

To reproduce the iterative training process in the paper, you need to do the following steps for 4 times:

S1. Data generation.

Follow the instructions in Preference Data Curation to generate preference pairs for the base model. Convert the generated jsonl file to huggingface parquet.
S2. Change training config.

In dataset code, replace data_path here to your data path.

In training script, replace --data_dir with a new directory, replace --model_name_or_path with the base model path, set --max_step to the number of steps for 4 epoch, set --save_steps to the number of steps for 1/4 epoch.
S3. Do DPO training.

Run the training script to train the base model.
S4. Choose base model for next iteration.

Evaluation

Inference on both training datasets and zero-shot datasets, UV-CoT can be changed to other model names saved in the ./checkpoints/

bash scripts/v1_5/eval/cot_benchmark.sh UV-CoT

Inference for ablation study

bash scripts/v1_5/eval/cot_benchmark_ablations.sh UV-CoT

Obtain the score using GPT-4o, the API KEY need to be set in llava/eval/eval_cot_score.py

bash scripts/v1_5/eval/cot_score.sh UV-CoT

Citation

If our work assists your research, feel free to give us a star ⭐ or cite us using:

@misc{zhao2025unsupervisedvisualchainofthoughtreasoning,
      title={Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization}, 
      author={Kesen Zhao and Beier Zhu and Qianru Sun and Hanwang Zhang},
      year={2025},
      eprint={2504.18397},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.18397}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
cot_one		cot_one
eval		eval
images		images
llava		llava
minicpm-llama3-v-25		minicpm-llama3-v-25
muffin		muffin
omnilmm		omnilmm
script		script
utils		utils
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
chat.py		chat.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

UV-CoT: unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization

News 🎉🎉🎉

Links

Overview

Visualization

Install

Preference Data Curation

Train

Evaluation

Citation

About

Uh oh!

Releases

Packages

Languages

kesenzhao/UV-CoT

Folders and files

Latest commit

History

Repository files navigation

UV-CoT: unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization

News 🎉🎉🎉

Links

Overview

Visualization

Install

Preference Data Curation

Train

Evaluation

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages