- Base: Pink_Base
- Base_Object365: Pink_Object365
- Base_RefCOCO: Pink_Refcoco
The pretraining dataset used in this release is the same as in LLaVA which is a subset of CC-3M dataset. Please see here for a detailed description on the dataset structure and how to download the images.
The datasets mentioned in the image need to be downloaded manually.
- COCO: train2017
- VisualGenome: part1, part2, objects, relationships, region descriptions
- Object365: Object365
- A-OKVQA: A-OKVQA
- LLaVA-158K: LLaVA-158K
We also provide the converted dataset used in the instruction tuning:
Our model is based on Llama-2-7b-chat-hf. You need to download the weights manually.
- Llama-2-7b-chat-hf: Llama-2-7b-chat-hf
- Install Package
conda create -n pink python=3.10 -y
conda activate pink
pip install --upgrade pip # enable PEP 660 support
pip install -e .Please refer to scripts/stage1.sh.
Please refer to scripts/stage2.sh.
Please refer to scripts/stage2_with_object365.sh.
We convert the *.json of Object365. Please refer to dataset_generation/object365_detection.py
Please refer to scripts/object365_generate.sh.
Please refer to scripts/object365_filter.py
Please refer to inference.ipynb and scripts/eval_refcoco.sh.
To launch a Gradio web demo, use the following command.
python demo.py --checkpoint-path /path/to/pink --llama-path /path/to/llama2
If you find Pink useful for your research and applications, please cite using this BibTeX:
@article{xuan2023pink,
title={Pink: Unveiling the power of referential comprehension for multi-modal llms},
author={Xuan, Shiyu and Guo, Qingpei and Yang, Ming and Zhang, Shiliang},
journal={arXiv preprint arXiv:2310.00582},
year={2023}
}
