This repo is the official implementation of "WeGen: A Unified Model for Interactive Multimodal Generation as We Chat", by Zhipeng Huang, Shaobin Zhuang, Canmiao Fu, Binxin Yang, Ying Zhang, Chong Sun, Zhizheng Zhang, Yali Wang, Chen Li, Zheng-Jun Zha
WeGen is a unified framework that integrates multimodal understanding and generation, enabling users to achieve various visual generation goals through natural conversation. It excels at generating diverse results with high creativity for less detailed instructions and can progressively refine prior generation results while maintaining consistency with user references.
- Unified Framework: Seamlessly integrates diverse capabilities including text-to-image generation, subject-driven generation, condition-driven generation, image restoration, and style transfer
- Dynamic Instance Identity Consistency (DIIC): Maintains instance identity consistency while allowing natural variations in generated contents
coming soon.
- Clone the repository:
git clone https://github.com/hzphzp/WeGen.git
cd WeGen/-
Prepare the base enviroment, we use ubuntu20, python3.8, with H20 or 910B GPUs
-
Install required packages:
bash env.sh- Download the pre-trained models from here and construct the pretrained model folder like:
WeGen
└── wegen_mllm_ckpt
├── pretrained
│ ├── CLIPScore_eval
│ ├── EVA-CLIP
│ ├── SEED-X
│ ├── meta-llama
│ │ └── Llama-2-7b-chat-hf
│ └── stable-diffusion-xl-base-1.0
├── pytorch_model.bin
├── stage1_final
│ └── unet
└── stage2_final
└── checkpoint-30000DIIC dataset coming soon.
run the following command to train the model on 128 H20/910B GPUs Node:
# stage1
bash scripts/wegen_mllm_stage1.sh
# stage2
bash scripts/wegen_mllm_stage2.sh
# stage3
bash scripts/wegen_mllm_stage3.shrun the following command to evaluate the model on 8 H20/910B GPUs Node:
bash scripts/inference.shIf you find this code and work useful, please consider citing the following paper and star this repo. Thank you very much!
@article{huang2025wegen,
title={WeGen: A Unified Model for Interactive Multimodal Generation as We Chat},
author={Huang, Zhipeng and Zhuang, Shaobin and Fu, Canmiao and Yang, Binxin and Zhang, Ying and Sun, Chong and Zhang, Zhizheng and Wang, Yali and Li, Chen and Zha, Zheng-Jun},
journal={arXiv preprint arXiv:2503.01115},
year={2025}
}