Prepare Datasets

As reported in our paper, we use two different datasets: the LLaVA_dataset and the ShareGPT4V_dataset. In this section, we will detail data preparation for training. For evaluation dataset, please see instructions in the Evaluation section

LLaVA dataset

Pretraining Images: The pretraining images of LLaVA is from the 558K subset of the LAION-CC-SBU dataset. Download as follows.
- LAION-CC-SBU-558K: images.zip.
Pretraining Annotations: The pretraining annotations of LLaVA. Download as follows.
- pretraining annotations: blip_laion_cc_sbu_558k.json.
SFT Images: The SFT images of LLaVA. Download as follows.
- LAION-CC-SBU-558K: Already download as “LAION-CC-SBU-558K” in Pretraining Images.
- COCO: This dataset is from the COCO2017_challenge. Download: train2017.
- GQA: GQA_project_page. Download: gqa_images.
- OCR-VQA: OCR-VQA_project_page. Download: download_script. We save all files as .jpg.
- TextVQA: TextVQA_project_page. Download: trainval_images.
- VisualGenome: VisualGenome_project_page. Download: part1, part2.
SFT Annotations: The SFT annotations of LLaVA. Download as follows.
- SFT annotations: llava_v1_5_mix665k.json.

ShareGPT4V dataset

Pretraining and SFT Images: The images of ShareGPT4V. Download as follows.
- LAION-CC-SBU-558K: Already download as “LAION-CC-SBU-558K” in LLaVA’s Pretraining Images.
- COCO: Already download as “COCO” in LLaVA’s SFT Images.
- WebData & Share_TextVQA: This dataset is curated by the ShareGPT4V_project. Download: images. Only for academic usage.
- SAM: This dataset is collected by Meta. Download: sam_images. We only use 000000~000050.tar for now. If you just want to use ShareGPT4V for SFT, you can quickly download 9K_images.
- GQA: Already download as “GQA” in LLaVA’s SFT Images.
- OCR-VQA: Already download as “OCR-VQA” in LLaVA’s SFT Images.
- TextVQA: Already download as “TextVQA” in LLaVA’s SFT Images.
- VisualGenome: Already download as “VisualGenome” in LLaVA’s SFT Images.
Pretraining Annotations: The pretraining annotations of ShareGPT4V. Download as follows.
- pretraining annotations: share-captioner_coco_lcs_sam_1246k_1107.json or really_cleaned_share-captioner_coco_lcs_sam_1246k_1107.json.
SFT Annotations: The SFT annotations of ShareGPT4V. Download as follows.
- SFT annotations: sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json or cleaned_sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json.

Organize Data

Organize the image files and annotation files as follows in path/to/your/dataset :

dataset
├── llava
│   ├── llava_pretrain
│   │   ├── images
├── coco
│   ├── train2017
├── sam
│   ├── images
├── gqa
│   ├── images
├── ocr_vqa
│   ├── images
├── textvqa
│   ├── train_images
├── vg
│   ├── VG_100K
│   ├── VG_100K_2
├── share_textvqa
│   ├── images
├── web-celebrity
│   ├── images
├── web-landmark
│   ├── images
├── wikiart
│   ├── images
├── text_files
│   ├── blip_laion_cc_sbu_558k.json
│   ├── llava_v1_5_mix665k.json
│   ├── share-captioner_coco_lcs_sam_1246k_1107.json
│   ├── really_cleaned_share-captioner_coco_lcs_sam_1246k_1107.json
│   ├── sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json
│   ├── cleaned_sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json