Unicorn: Text-Only Data Synthesis for Vision Language Model Training

News

[2025/06/10] 🔥 We release the training code of the Unicorn. Try training!
[2025/04/15] 🔥 Release Unicorn-1.2M & Unicorn-Instruction-471K Datasets. [HF]

Training

Our code is based on Bunny.

Env

Create a conda virtual environment and activate it:

conda create -n bunny python=3.10
conda activate bunny

Basic requirements

pip install --upgrade pip  # enable PEP 660 support
pip install transformers
pip install torch torchvision xformers --index-url https://download.pytorch.org/whl/cu118

Install apex

# https://github.com/NVIDIA/apex#from-source
pip install ninja
git clone https://github.com/NVIDIA/apex
cd apex
# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key...
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
# otherwise
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Install flash-attention

# https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#installation-and-features
pip install packaging
pip install flash-attn --no-build-isolation

Install bunny and other requirements

git clone https://github.com/BAAI-DCAI/Bunny.git
cd Bunny
pip install -e .

Data

Download the Unicorn-1.2M & Unicorn-Instruction-471K Datasets. [HF]

Embed the captions from Unicorn-1.2M to get text embeddings

python image_embed.py

Mean shift to get synthetic image embeddings

python embed_mean.py

Then, change the embedding path in data_utils.py

folder_path = ''

Note: the same pkl file is used in both the pretraining and instruction-tuning stages

Pretrain

sh script/train/pretrain.sh

Instruction Tuning

sh script/train/finetune_full.sh

Citation

If you find this repository helpful, please cite the paper below.

@article{yu2025unicorn,
  title={Unicorn: Text-only data synthesis for vision language model training},
  author={Yu, Xiaomin and Ding, Pengxiang and Zhang, Wenjie and Huang, Siteng and Gao, Songyang and Qin, Chengwei and Wu, Kejian and Fan, Zhaoxin and Qiao, Ziyue and Wang, Donglin},
  journal={arXiv preprint arXiv:2503.22655},
  year={2025}
}

Contact

If you have any questions, please contact: [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
bunny		bunny
data		data
script		script
LICENSE		LICENSE
README.md		README.md
inference.py		inference.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Unicorn: Text-Only Data Synthesis for Vision Language Model Training

News

Training

Env

Data

Pretrain

Instruction Tuning

Citation

Contact

About

Uh oh!

Releases

Packages

Languages

License

Yu-xm/Unicorn

Folders and files

Latest commit

History

Repository files navigation

Unicorn: Text-Only Data Synthesis for Vision Language Model Training

News

Training

Env

Data

Pretrain

Instruction Tuning

Citation

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages