Ruiyan Han*, Zhen Fang*, Xinyu Sun*, Yuchen Ma, Ziheng Wang, Yu Zeng, Zehui Chen, Lin Chen, Wenxuan Huang, Wei-Jie Xu, Yi Cao, and Feng Zhao
contact: [email protected]
While Unified Multimodal Models (UMMs) have achieved remarkable success in cross-modal comprehension, a significant gap persists in their ability to leverage such internal knowledge for high-quality generation. We formalize this discrepancy as Conduction Aphasia, a phenomenon where models accurately interpret multimodal inputs but struggle to translate that understanding into faithful and controllable synthesis. To address this, we propose UniCorn, a simple yet elegant self-improvement framework that eliminates the need for external data or teacher supervision. By partitioning a single UMM into three collaborative roles: Proposer, Solver, and Judge, UniCorn generates high-quality interactions via self-play and employs cognitive pattern reconstruction to distill latent understanding into explicit generative signals. To validate the restoration of multimodal coherence, we introduce UniCycle, a cycle-consistency benchmark based on a Text to Image to Text reconstruction loop. Extensive experiments demonstrate that UniCorn achieves comprehensive and substantial improvements over the base model across six general image generation benchmarks. Notably, it achieves SOTA performance on TIIF(73.8), DPG(86.8), CompBench(88.5), and UniCycle while further delivering substantial gains of +5.0 on WISE and +6.5 on OneIG. These results highlight that our method significantly enhances T2I generation while maintaining robust comprehension, demonstrating the scalability of fully self-supervised refinement for unified multimodal intelligence.
We sincerely thank all contributors from the open community for their valuable support.
- Jan. 12, 2026: We released the our checkpoint. Welcome to download and try!
- Jan. 7, 2026: We released the official report for UniCorn.
This list tracks the progress of our open-source development and model optimization:
-
Release the code.
-
Release the ckpt.
We appreciate the support from our contributors and the open-source community.
Follow the Bagel's original settings, you should focus:
About Inference Hyperparameters:
cfg_text_scale: Controls how strongly the model follows the text prompt.1.0disables text guidance. Typical range:4.0–8.0.cfg_image_scale: Controls how much the model preserves input image details.1.0disables image guidance. Typical range:1.0–2.0.cfg_interval: Fraction of denoising steps where CFG is applied. Later steps can skip CFG to reduce computation. Typical:[0.4, 1.0].timestep_shift: Shifts the distribution of denoising steps. Higher values allocate more steps at the start (affects layout); lower values allocate more at the end (improves details).num_timesteps: Total denoising steps. Typical:50.cfg_renorm_min: Minimum value for CFG-Renorm.1.0disables renorm. Typical:0.cfg_renorm_type: CFG-Renorm method:global: Normalize over all tokens and channels (default for T2I).channel: Normalize across channels for each token.text_channel: Likechannel, but only applies to text condition (good for editing, may cause blur).
- If edited images appear blurry, try
globalCFG-Renorm, decreasecfg_renorm_minor decreasecfg_scale.
This project is built upon several excellent open-source projects: BAGEL, IRG and SRUM. We sincerely thank the authors for their contributions:
We are grateful to the broader research community for their open-source spirit and collaborative efforts.
@article{han2026unicorn,
title={UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision},
author={Han, Ruiyan and Fang, Zhen and Sun, XinYu and Ma, Yuchen and Wang, Ziheng and Zeng, Yu and Chen, Zehui and Chen, Lin and Huang, Wenxuan and Xu, Wei-Jie and others},
journal={arXiv preprint arXiv:2601.03193},
year={2026},
}UniCorn is licensed under the Apache 2.0.


