Skip to content

Yu-xm/Unicorn

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Unicorn: Text-Only Data Synthesis for Vision Language Model Training

📄 Paper | 🤗 Data

News

  • [2025/06/10] 🔥 We release the training code of the Unicorn. Try training!
  • [2025/04/15] 🔥 Release Unicorn-1.2M & Unicorn-Instruction-471K Datasets. [HF]

Training

Our code is based on Bunny.

Env

Create a conda virtual environment and activate it:

conda create -n bunny python=3.10
conda activate bunny

Basic requirements

pip install --upgrade pip  # enable PEP 660 support
pip install transformers
pip install torch torchvision xformers --index-url https://download.pytorch.org/whl/cu118

Install apex

# https://github.com/NVIDIA/apex#from-source
pip install ninja
git clone https://github.com/NVIDIA/apex
cd apex
# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key...
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
# otherwise
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Install flash-attention

# https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#installation-and-features
pip install packaging
pip install flash-attn --no-build-isolation

Install bunny and other requirements

git clone https://github.com/BAAI-DCAI/Bunny.git
cd Bunny
pip install -e .

Data

Download the Unicorn-1.2M & Unicorn-Instruction-471K Datasets. [HF]

Embed the captions from Unicorn-1.2M to get text embeddings

python image_embed.py

Mean shift to get synthetic image embeddings

python embed_mean.py

Then, change the embedding path in data_utils.py

folder_path = ''

Note: the same pkl file is used in both the pretraining and instruction-tuning stages

Pretrain

sh script/train/pretrain.sh

Instruction Tuning

sh script/train/finetune_full.sh

Citation

If you find this repository helpful, please cite the paper below.

@article{yu2025unicorn,
  title={Unicorn: Text-only data synthesis for vision language model training},
  author={Yu, Xiaomin and Ding, Pengxiang and Zhang, Wenjie and Huang, Siteng and Gao, Songyang and Qin, Chengwei and Wu, Kejian and Fan, Zhaoxin and Qiao, Ziyue and Wang, Donglin},
  journal={arXiv preprint arXiv:2503.22655},
  year={2025}
}

Contact

If you have any questions, please contact: [email protected]

About

Text-Only Data Synthesis for Vision Language Model Training

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published