PreSel: Pre-Instruction Data Selection
for Visual Instruction Tuning

🌟 CVPR 2025 Highlight Paper 🌟

Bardia Safaei Faizan Siddiqui Jiacong Xu Vishal M. Patel Shao-Yuan Lo

Johns Hopkins University, Honda Research Institute USA

Release Notes

[06/08/2025]: 🔥 PreSel codebase is released. The selected 15% data and the finetuned models on these selected data can be downloaded now.

Installation

1. Prepare the Environment

Please first install LLaVA：

cd PreSel
git clone https://github.com/haotian-liu/LLaVA.git

Then prepare the environment for LLaVA here.

Dataset Preparation

1. Download the Datasets

LLaVA-665K Dataset

For the LLaVA dataset, please download the LLaVA-665K dataset following the instructions from the LLaVA GitHub repository. This dataset is used for visual instruction tuning and contains a diverse set of visual-language examples.

Vision-FLAN Dataset

For the Vision-FLAN dataset, please download the data from the Vision-FLAN website. This dataset provides a comprehensive collection of visual-language tasks for instruction tuning.

After downloading the datasets, please place all data files in the /datasets directory.

2. Preprocess the Dataset

We first add a unique index for each instruction in the original dataset, to better identify each sample:

python data_process/preprocess.py \
    --raw_annotation_path datasets/your_dataset.json \
    --new_annotation_save_path datasets/processed_dataset.json

This script adds a unique identifier to each sample in your dataset, which is essential for the data selection process. The processed dataset will be saved to the specified path. We will be using the json files with the unique_idx included in the code.

Please note that as stated in the paper, for the LLaVA-1.5 dataset we remove the text-only instructions from the data, as our method focuses on selecting the images. You can either remove them yourself or use the already processed json file here.

3. Task Splits

For our method, we need to split the dataset into different tasks. We provide the task splits used in our experiments:

LLaVA-1.5 task splits: Download splits
Vision-FLAN dataset: Download splits

Place the downloaded and unzipped task split files in the data/ directory.

4. Reference Model Training

To estimate task importance values, we need a reference model trained on a small randomly selected reference dataset. You have two options:

Option 1: Use Our Pre-selected Reference Datasets

For LLaVA-1.5 and Vision-FLAN datasets, you can directly use our randomly selected reference datasets (5% of images and their corresponding instructions from each task):

LLaVA-1.5 reference data (randomly selected 5% images with instructions): Download JSON
Vision-FLAN reference data (randomly selected 5% images with instructions): Download JSON

Place the downloaded JSON files in the data/ directory.

Option 2: Create Your Own Reference Dataset

For custom datasets, you'll need to create a reference dataset by randomly sampling 5% of images along with their corresponding instructions from each task.

After preparing the reference dataset, fine-tune a LLaVA-7B model on it to obtain the reference model. For this step:

Fine-tune the LLaVA-7B model huggingface using LoRA training following the script provided here

This reference model will be used in later steps to estimate task-importance values.

Usage

1. Loss/Perplexity Calculations

First, process the reference data to remove the question parts of the instructions:

python data_process/remove_instruction.py \
    --input_path /data/round1_665k_notext.json \
    --output_path /data/round1_665k_notext_img_token.json

This will create a new file (/data/round1_665k_notext_img_token.json).

Then run the loss/perplexity calculations twice:

python presel/loss_ppl_calc.py \
    --data_path /data/round1_665k_notext.json \
    --model_path /PATH/TO/REFERENCE_MODEL \
    --image_folder /datasets \
    --output_file /data/loss_ppl_round1_665k_notext.json

python presel/loss_ppl_calc.py \
    --data_path /data/round1_665k_notext_img_token.json \
    --model_path /PATH/TO/REFERENCE_MODEL \
    --image_folder /datasets \
    --output_file /data/loss_ppl_round1_665k_notext_img_token.json

Replace /PATH/TO/REFERENCE_MODEL with the path to your reference model checkpoint.
Adjust --image_folder and --output_file as needed for your setup.

2. Task Importance Estimation

Run the following to get the estimated task-importance values required for our data selection approach:

python presel/llava_task_importance.py \
    --data_w_path /data/loss_ppl_round1_665k_notext.json \
    --data_wo_path /data/loss_ppl_round1_665k_notext_img_token.json \
    --reference_data_path /data/round1_665k_notext.json \
    --task_files_dir /data \
    --output_dir /data

3. Pre-Instruction Data Selection

First, we extract the visual features using DINOv2 model for each task (1 to 10 for the LLaVA dataset):

python data_process/extract_feats_665_dino.py --task_num TASK_NUM

Then run k-means clustering and sample selection:

python data_process/kmeans_clust.py --method typical

Finally, run the following command to finetune the model on the selected data. Make sure to set the BASE_DIR value appropriately. This code implements multi-round training where each round has a budget of 5% of the total data. Note that the results reported in the main paper correspond to round 3 (15% budget).

python presel/data_selection.py \
    --base_dir BASE_DIR \
    --method presel \
    --dataset_type llava

Running on the Vision-FLAN Dataset

For the Vision-FLAN dataset, the steps are similar to those for the LLaVA-1.5 dataset mentioned above. For "Loss/Perplexity Calculations", you can follow the same steps, but make sure to adjust the code to match the Vision-FLAN data format (e.g., JSON files, reference set, image folder, etc.).

For "Task Importance Estimation", you can directly download the estimated task importance values here and place it in /data directory.

For "Pre-Instruction Data Selection", first use the same script, data_process/extract_feats_665_dino.py, to extract VF features. Save the output as /data/dino_feats_vf/dino_feats_all_vf.pt. Then, run

python data_process/kmeans_clust_vf.py --method typical

Finally, run the following command to fine-tune the model on the selected Vision-FLAN data:

python presel/data_selection.py \
    --base_dir BASE_DIR \
    --method presel \
    --dataset_type vision_flan \
    --file_path /datasets/annotation_191-task_1k_add_idx.json

Finetuned Models & Selected Data (15%)

You can find our selected 15% subset of data via PreSel, as well as the fine-tuned models trained on it here:

Dataset	15% Selected Data by PreSel (JSON)	LLaVA-7B Model Finetuned
LLaVA-1.5	Download	Download
Vision-FLAN	Download	Download

Evaluation

Please follow the original LLaVA page and VLMEvalKit to evaluate models.

Citation

If you find this codebase useful for your research, please cite our paper:

@inproceedings{safaei2025filter,
  title={Filter images first, generate instructions later: Pre-instruction data selection for visual instruction tuning},
  author={Safaei, Bardia and Siddiqui, Faizan and Xu, Jiacong and Patel, Vishal M and Lo, Shao-Yuan},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={14247--14256},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
checkpoints		checkpoints
data		data
data_process		data_process
datasets		datasets
demos		demos
models		models
presel		presel
LICENSE		LICENSE
README.md		README.md
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PreSel: Pre-Instruction Data Selection
for Visual Instruction Tuning

Release Notes

Contents

Installation

1. Prepare the Environment

Dataset Preparation

1. Download the Datasets

LLaVA-665K Dataset

Vision-FLAN Dataset

2. Preprocess the Dataset

3. Task Splits

4. Reference Model Training

Option 1: Use Our Pre-selected Reference Datasets

Option 2: Create Your Own Reference Dataset

Usage

1. Loss/Perplexity Calculations

2. Task Importance Estimation

3. Pre-Instruction Data Selection

Running on the Vision-FLAN Dataset

Finetuned Models & Selected Data (15%)

Evaluation

Citation

About

Uh oh!

Releases

Packages

Languages

License

bardisafa/PreSel

Folders and files

Latest commit

History

Repository files navigation

PreSel: Pre-Instruction Data Selection for Visual Instruction Tuning

Release Notes

Contents

Installation

1. Prepare the Environment

Dataset Preparation

1. Download the Datasets

LLaVA-665K Dataset

Vision-FLAN Dataset

2. Preprocess the Dataset

3. Task Splits

4. Reference Model Training

Option 1: Use Our Pre-selected Reference Datasets

Option 2: Create Your Own Reference Dataset

Usage

1. Loss/Perplexity Calculations

2. Task Importance Estimation

3. Pre-Instruction Data Selection

Running on the Vision-FLAN Dataset

Finetuned Models & Selected Data (15%)

Evaluation

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

PreSel: Pre-Instruction Data Selection
for Visual Instruction Tuning

Packages