Prepare Datasets
As reported in our paper, we use two different datasets: the LLaVA_dataset and the ShareGPT4V_dataset. In this section, we will detail data preparation for training. For evaluation dataset, please see instructions in the Evaluation section
LLaVA dataset
Pretraining Images: The pretraining images of LLaVA is from the 558K subset of the LAION-CC-SBU dataset. Download as follows.
LAION-CC-SBU-558K: images.zip.
Pretraining Annotations: The pretraining annotations of LLaVA. Download as follows.
pretraining annotations: blip_laion_cc_sbu_558k.json.
SFT Images: The SFT images of LLaVA. Download as follows.
LAION-CC-SBU-558K: Already download as “LAION-CC-SBU-558K” in Pretraining Images.
COCO: This dataset is from the COCO2017_challenge. Download: train2017.
GQA: GQA_project_page. Download: gqa_images.
OCR-VQA: OCR-VQA_project_page. Download: download_script. We save all files as
.jpg.
TextVQA: TextVQA_project_page. Download: trainval_images.
VisualGenome: VisualGenome_project_page. Download: part1, part2.
SFT Annotations: The SFT annotations of LLaVA. Download as follows.
SFT annotations: llava_v1_5_mix665k.json.
ShareGPT4V dataset
Pretraining and SFT Images: The images of ShareGPT4V. Download as follows.
LAION-CC-SBU-558K: Already download as “LAION-CC-SBU-558K” in LLaVA’s Pretraining Images.
COCO: Already download as “COCO” in LLaVA’s SFT Images.
WebData & Share_TextVQA: This dataset is curated by the ShareGPT4V_project. Download: images. Only for academic usage.
SAM: This dataset is collected by Meta. Download: sam_images. We only use 000000~000050.tar for now. If you just want to use ShareGPT4V for SFT, you can quickly download 9K_images.
GQA: Already download as “GQA” in LLaVA’s SFT Images.
OCR-VQA: Already download as “OCR-VQA” in LLaVA’s SFT Images.
TextVQA: Already download as “TextVQA” in LLaVA’s SFT Images.
VisualGenome: Already download as “VisualGenome” in LLaVA’s SFT Images.
Pretraining Annotations: The pretraining annotations of ShareGPT4V. Download as follows.
pretraining annotations: share-captioner_coco_lcs_sam_1246k_1107.json or really_cleaned_share-captioner_coco_lcs_sam_1246k_1107.json.
SFT Annotations: The SFT annotations of ShareGPT4V. Download as follows.
Organize Data
Organize the image files and annotation files as follows in path/to/your/dataset :
dataset
├── llava
│ ├── llava_pretrain
│ │ ├── images
├── coco
│ ├── train2017
├── sam
│ ├── images
├── gqa
│ ├── images
├── ocr_vqa
│ ├── images
├── textvqa
│ ├── train_images
├── vg
│ ├── VG_100K
│ ├── VG_100K_2
├── share_textvqa
│ ├── images
├── web-celebrity
│ ├── images
├── web-landmark
│ ├── images
├── wikiart
│ ├── images
├── text_files
│ ├── blip_laion_cc_sbu_558k.json
│ ├── llava_v1_5_mix665k.json
│ ├── share-captioner_coco_lcs_sam_1246k_1107.json
│ ├── really_cleaned_share-captioner_coco_lcs_sam_1246k_1107.json
│ ├── sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json
│ ├── cleaned_sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json