GitHub - SxJyJay/UniToken: [CVPRW 2025] UniToken is an auto-regressive generation model that combines discrete and continuous representations to process visual inputs, making it easy to integrate both visual understanding and image generation tasks seamlessly.

UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding

Yang Jiao^1,2, Haibo Qiu³, Zequn Jie³, Shaoxiang Chen³, Jingjing Chen^1,2,
Lin Ma³, Yu-Gang Jiang^1,2

¹Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University
²Shanghai Collaborative Innovation Center on Intelligent Visual Computing
³Meituan

📣 News

[2025-04-02] 🎉🎉🎉 UniToken paper is accepted to CVPR 2025 workshop! 🎉🎉🎉
[2025-04-01] 🎉🎉🎉 We release the recaptioned text prompts of GenEval and T2I-Compbench! 🎉🎉🎉
[2025-04-01] 🎉🎉🎉 UniToken paper and training codes are released! 🎉🎉🎉

🛠️ Installation

See INSTALL.md for detailed instructions.

🎓 Training

See unitoken/TRAIN.md

🤖 Inference

Preparation

Download the original VQ-VAE weights, Lumina-mGPT-512 and SigLIP, and put them to the following directory:

UniToken
- unitoken/
    - ckpts/
        - chameleon/
            - tokenizer/
                - text_tokenizer.json
                - vqgan.yaml
                - vqgan.ckpt
        - Lumina-mGPT-7B-512/
        - SigLIP/
- xllmx/
- ...

Simple Inference

The simplest code for UniToken inference:

from inference_solver_anyres import FlexARInferenceSolverAnyRes
from PIL import Image

# ******************** Image Generation ********************
inference_solver = FlexARInferenceSolverAnyRes(
    model_path="OceanJay/UniToken-AnyRes-StageII",
    precision="bf16",
    target_size=512,
)

q1 = f"Generate an image according to the following prompt:\n" \
     f"A majestic phoenix with fiery wings soaring above a tranquil mountain lake, casting shimmering reflections on the water. Sparks and embers trail behind it as the sky glows with hues of orange and gold."

# generated: tuple of (generated response, list of generated images)
generated = inference_solver.generate_img(
    images=[],
    qas=[[q1, None]],
    max_gen_len=1536,
    temperature=1.0,
    logits_processor=inference_solver.create_logits_processor(cfg=3.0, image_top_k=4000),
)

a1, new_image = generated[0], generated[1][0]


# ******************* Image Understanding ******************
inference_solver = FlexARInferenceSolverAnyRes(
    model_path="OceanJay/UniToken-AnyRes-StageII",
    precision="bf16",
    target_size=512,
)

# "<|image|>" symbol will be replaced with sequence of image tokens before fed to LLM
q1 = "<|image|>Please describe the details of the image as much as possible."

images = [Image.open("../assets/1.png").convert('RGB')]
qas = [[q1, None]]

# `len(images)` should be equal to the number of appearance of "<|image|>" in qas
generated = inference_solver.generate(
    images=images,
    qas=qas,
    max_gen_len=512,
    temperature=1.0,
    logits_processor=inference_solver.create_logits_processor(cfg=4.0, image_top_k=2000),
)

a1 = generated[0]
# generated[1], namely the list of newly generated images, should typically be empty in this case.

🤗 Checkpoints

Model	Huggingface
UniToken-base-StageI	OceanJay/UniToken-base-StageI
UniToken-base-StageII	OceanJay/UniToken-base-StageII
UniToken-AnyRes-StageI	OceanJay/UniToken-AnyRes-StageI
UniToken-AnyRes-StageII	OceanJay/UniToken-AnyRes-StageII

📚 Datasets

We've observed that existing text-to-image generation models struggle with short text prompts in benchmarks such as GenEval and T2I-Compbench++. To address this issue, we have revised these prompts to be more descriptive. We are excited to share our enhanced version on Hugging Face. We encourage you to try it out and see the improvements for your own model!

🙏 Acknowledgement

We sincerely appreciate Lumina-mGPT for providing high-quality training codes, as well as Emu3 and Janus for releasing pretrained checkpoints for evaluation.

📄 Citation

@article{jiao2025unitoken,
  title={UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding},
  author={Jiao, Yang and Qiu, Haibo and Jie, Zequn and Chen, Shaoxiang and Chen, Jingjing and Ma, Lin and Jiang, Yu-Gang},
  journal={arXiv preprint arXiv:2504.04423},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
unitoken		unitoken
xllmx		xllmx
.gitignore		.gitignore
.isort.cfg		.isort.cfg
.pre-commit-config.yaml		.pre-commit-config.yaml
INSTALL.md		INSTALL.md
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding

📣 News

🛠️ Installation

🎓 Training

🤖 Inference

Preparation

Simple Inference

🤗 Checkpoints

📚 Datasets

🙏 Acknowledgement

📄 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

SxJyJay/UniToken

Folders and files

Latest commit

History

Repository files navigation

UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding

📣 News

🛠️ Installation

🎓 Training

🤖 Inference

Preparation

Simple Inference

🤗 Checkpoints

📚 Datasets

🙏 Acknowledgement

📄 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages