Skip to content

[CVPR 2025] An Implementation of the paper "Pre-Instruction Data Selection for Visual Instruction Tuning"

License

Notifications You must be signed in to change notification settings

bardisafa/PreSel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

88 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

PreSel: Pre-Instruction Data Selection
for Visual Instruction Tuning

๐ŸŒŸ CVPR 2025 Highlight Paper ๐ŸŒŸ

Bardia Safaei โ€ƒ Faizan Siddiqui โ€ƒ Jiacong Xu โ€ƒ Vishal M. Patel โ€ƒ Shao-Yuan Lo

Johns Hopkins University, Honda Research Institute USA


Release Notes

  • [06/08/2025]: ๐Ÿ”ฅ PreSel codebase is released. The selected 15% data and the finetuned models on these selected data can be downloaded now.

Contents


Installation

1. Prepare the Environment

Please first install LLaVA๏ผš

cd PreSel
git clone https://github.com/haotian-liu/LLaVA.git

Then prepare the environment for LLaVA here.

Dataset Preparation

1. Download the Datasets

LLaVA-665K Dataset

For the LLaVA dataset, please download the LLaVA-665K dataset following the instructions from the LLaVA GitHub repository. This dataset is used for visual instruction tuning and contains a diverse set of visual-language examples.

Vision-FLAN Dataset

For the Vision-FLAN dataset, please download the data from the Vision-FLAN website. This dataset provides a comprehensive collection of visual-language tasks for instruction tuning.

After downloading the datasets, please place all data files in the /datasets directory.

2. Preprocess the Dataset

We first add a unique index for each instruction in the original dataset, to better identify each sample:

python data_process/preprocess.py \
    --raw_annotation_path datasets/your_dataset.json \
    --new_annotation_save_path datasets/processed_dataset.json

This script adds a unique identifier to each sample in your dataset, which is essential for the data selection process. The processed dataset will be saved to the specified path. We will be using the json files with the unique_idx included in the code.

Please note that as stated in the paper, for the LLaVA-1.5 dataset we remove the text-only instructions from the data, as our method focuses on selecting the images. You can either remove them yourself or use the already processed json file here.

3. Task Splits

For our method, we need to split the dataset into different tasks. We provide the task splits used in our experiments:

Place the downloaded and unzipped task split files in the data/ directory.

4. Reference Model Training

To estimate task importance values, we need a reference model trained on a small randomly selected reference dataset. You have two options:

Option 1: Use Our Pre-selected Reference Datasets

For LLaVA-1.5 and Vision-FLAN datasets, you can directly use our randomly selected reference datasets (5% of images and their corresponding instructions from each task):

  • LLaVA-1.5 reference data (randomly selected 5% images with instructions): Download JSON
  • Vision-FLAN reference data (randomly selected 5% images with instructions): Download JSON

Place the downloaded JSON files in the data/ directory.

Option 2: Create Your Own Reference Dataset

For custom datasets, you'll need to create a reference dataset by randomly sampling 5% of images along with their corresponding instructions from each task.

After preparing the reference dataset, fine-tune a LLaVA-7B model on it to obtain the reference model. For this step:

Fine-tune the LLaVA-7B model huggingface using LoRA training following the script provided here

This reference model will be used in later steps to estimate task-importance values.

Usage

1. Loss/Perplexity Calculations

First, process the reference data to remove the question parts of the instructions:

python data_process/remove_instruction.py \
    --input_path /data/round1_665k_notext.json \
    --output_path /data/round1_665k_notext_img_token.json

This will create a new file (/data/round1_665k_notext_img_token.json).


Then run the loss/perplexity calculations twice:

python presel/loss_ppl_calc.py \
    --data_path /data/round1_665k_notext.json \
    --model_path /PATH/TO/REFERENCE_MODEL \
    --image_folder /datasets \
    --output_file /data/loss_ppl_round1_665k_notext.json
python presel/loss_ppl_calc.py \
    --data_path /data/round1_665k_notext_img_token.json \
    --model_path /PATH/TO/REFERENCE_MODEL \
    --image_folder /datasets \
    --output_file /data/loss_ppl_round1_665k_notext_img_token.json
  • Replace /PATH/TO/REFERENCE_MODEL with the path to your reference model checkpoint.
  • Adjust --image_folder and --output_file as needed for your setup.

2. Task Importance Estimation

Run the following to get the estimated task-importance values required for our data selection approach:

python presel/llava_task_importance.py \
    --data_w_path /data/loss_ppl_round1_665k_notext.json \
    --data_wo_path /data/loss_ppl_round1_665k_notext_img_token.json \
    --reference_data_path /data/round1_665k_notext.json \
    --task_files_dir /data \
    --output_dir /data

3. Pre-Instruction Data Selection

First, we extract the visual features using DINOv2 model for each task (1 to 10 for the LLaVA dataset):

python data_process/extract_feats_665_dino.py --task_num TASK_NUM

Then run k-means clustering and sample selection:

python data_process/kmeans_clust.py --method typical

Finally, run the following command to finetune the model on the selected data. Make sure to set the BASE_DIR value appropriately. This code implements multi-round training where each round has a budget of 5% of the total data. Note that the results reported in the main paper correspond to round 3 (15% budget).

python presel/data_selection.py \
    --base_dir BASE_DIR \
    --method presel \
    --dataset_type llava

Running on the Vision-FLAN Dataset

For the Vision-FLAN dataset, the steps are similar to those for the LLaVA-1.5 dataset mentioned above. For "Loss/Perplexity Calculations", you can follow the same steps, but make sure to adjust the code to match the Vision-FLAN data format (e.g., JSON files, reference set, image folder, etc.).

For "Task Importance Estimation", you can directly download the estimated task importance values here and place it in /data directory.

For "Pre-Instruction Data Selection", first use the same script, data_process/extract_feats_665_dino.py, to extract VF features. Save the output as /data/dino_feats_vf/dino_feats_all_vf.pt. Then, run

python data_process/kmeans_clust_vf.py --method typical

Finally, run the following command to fine-tune the model on the selected Vision-FLAN data:

python presel/data_selection.py \
    --base_dir BASE_DIR \
    --method presel \
    --dataset_type vision_flan \
    --file_path /datasets/annotation_191-task_1k_add_idx.json

Finetuned Models & Selected Data (15%)

You can find our selected 15% subset of data via PreSel, as well as the fine-tuned models trained on it here:

Dataset 15% Selected Data by PreSel (JSON) LLaVA-7B Model Finetuned
LLaVA-1.5 Download Download
Vision-FLAN Download Download

Evaluation

Please follow the original LLaVA page and VLMEvalKit to evaluate models.

Citation

If you find this codebase useful for your research, please cite our paper:

@inproceedings{safaei2025filter,
  title={Filter images first, generate instructions later: Pre-instruction data selection for visual instruction tuning},
  author={Safaei, Bardia and Siddiqui, Faizan and Xu, Jiacong and Patel, Vishal M and Lo, Shao-Yuan},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={14247--14256},
  year={2025}
}

About

[CVPR 2025] An Implementation of the paper "Pre-Instruction Data Selection for Visual Instruction Tuning"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published