Prepare Datasets

As reported in our paper, we use two different datasets: the LLaVA_dataset and the ShareGPT4V_dataset. In this section, we will detail data preparation for training. For evaluation dataset, please see instructions in the Evaluation section

LLaVA dataset

ShareGPT4V dataset

Organize Data

Organize the image files and annotation files as follows in path/to/your/dataset :

dataset
├── llava
│   ├── llava_pretrain
│      ├── images
├── coco
│   ├── train2017
├── sam
│   ├── images
├── gqa
│   ├── images
├── ocr_vqa
│   ├── images
├── textvqa
│   ├── train_images
├── vg
│   ├── VG_100K
│   ├── VG_100K_2
├── share_textvqa
│   ├── images
├── web-celebrity
│   ├── images
├── web-landmark
│   ├── images
├── wikiart
│   ├── images
├── text_files
│   ├── blip_laion_cc_sbu_558k.json
│   ├── llava_v1_5_mix665k.json
│   ├── share-captioner_coco_lcs_sam_1246k_1107.json
│   ├── really_cleaned_share-captioner_coco_lcs_sam_1246k_1107.json
│   ├── sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json
│   ├── cleaned_sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json