UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision

Ruiyan Han*, Zhen Fang*, Xinyu Sun*, Yuchen Ma, Ziheng Wang, Yu Zeng, Zehui Chen, Lin Chen, Wenxuan Huang, Wei-Jie Xu, Yi Cao, and Feng Zhao

contact: [email protected]

While Unified Multimodal Models (UMMs) have achieved remarkable success in cross-modal comprehension, a significant gap persists in their ability to leverage such internal knowledge for high-quality generation. We formalize this discrepancy as Conduction Aphasia, a phenomenon where models accurately interpret multimodal inputs but struggle to translate that understanding into faithful and controllable synthesis. To address this, we propose UniCorn, a simple yet elegant self-improvement framework that eliminates the need for external data or teacher supervision. By partitioning a single UMM into three collaborative roles: Proposer, Solver, and Judge, UniCorn generates high-quality interactions via self-play and employs cognitive pattern reconstruction to distill latent understanding into explicit generative signals. To validate the restoration of multimodal coherence, we introduce UniCycle, a cycle-consistency benchmark based on a Text to Image to Text reconstruction loop. Extensive experiments demonstrate that UniCorn achieves comprehensive and substantial improvements over the base model across six general image generation benchmarks. Notably, it achieves SOTA performance on TIIF(73.8), DPG(86.8), CompBench(88.5), and UniCycle while further delivering substantial gains of +5.0 on WISE and +6.5 on OneIG. These results highlight that our method significantly enhances T2I generation while maintaining robust comprehension, demonstrating the scalability of fully self-supervised refinement for unified multimodal intelligence.

📢 News

We sincerely thank all contributors from the open community for their valuable support.

Jan. 12, 2026: We released the our checkpoint. Welcome to download and try!
Jan. 7, 2026: We released the official report for UniCorn.

📝 To-Do List

This list tracks the progress of our open-source development and model optimization:

Release the code.
Release the ckpt.

We appreciate the support from our contributors and the open-source community.

📮 Notice

Follow the Bagel's original settings, you should focus:

About Inference Hyperparameters:

cfg_text_scale: Controls how strongly the model follows the text prompt. 1.0 disables text guidance. Typical range: 4.0–8.0.
cfg_image_scale: Controls how much the model preserves input image details. 1.0 disables image guidance. Typical range: 1.0–2.0.
cfg_interval: Fraction of denoising steps where CFG is applied. Later steps can skip CFG to reduce computation. Typical: [0.4, 1.0].
timestep_shift: Shifts the distribution of denoising steps. Higher values allocate more steps at the start (affects layout); lower values allocate more at the end (improves details).
num_timesteps: Total denoising steps. Typical: 50.
cfg_renorm_min: Minimum value for CFG-Renorm. 1.0 disables renorm. Typical: 0.
cfg_renorm_type: CFG-Renorm method:
- global: Normalize over all tokens and channels (default for T2I).
- channel: Normalize across channels for each token.
- text_channel: Like channel, but only applies to text condition (good for editing, may cause blur).
If edited images appear blurry, try global CFG-Renorm, decrease cfg_renorm_min or decrease cfg_scale.

📊 Benchmarks

🎨 Visualization

🙏 Acknowledgments

This project is built upon several excellent open-source projects: BAGEL, IRG and SRUM. We sincerely thank the authors for their contributions:

We are grateful to the broader research community for their open-source spirit and collaborative efforts.

✍️ Citation

@article{han2026unicorn,
      title={UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision}, 
      author={Han, Ruiyan and Fang, Zhen and Sun, XinYu and Ma, Yuchen and Wang, Ziheng and Zeng, Yu and Chen, Zehui and Chen, Lin and Huang, Wenxuan and Xu, Wei-Jie and others},
      journal={arXiv preprint arXiv:2601.03193},
      year={2026},
}

📜 License

UniCorn is licensed under the Apache 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision

📢 News

📝 To-Do List

📮 Notice

📊 Benchmarks

🎨 Visualization

🙏 Acknowledgments

✍️ Citation

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision

📢 News

📝 To-Do List

📮 Notice

📊 Benchmarks

🎨 Visualization

🙏 Acknowledgments

✍️ Citation

📜 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages