Skip to content

[CVPRW 2025] UniToken is an auto-regressive generation model that combines discrete and continuous representations to process visual inputs, making it easy to integrate both visual understanding and image generation tasks seamlessly.

Notifications You must be signed in to change notification settings

SxJyJay/UniToken

Repository files navigation

UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding

Yang Jiao1,2,   Haibo Qiu3,   Zequn Jie3,   Shaoxiang Chen3,   Jingjing Chen1,2,  
Lin Ma3,   Yu-Gang Jiang1,2

1Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University  
2Shanghai Collaborative Innovation Center on Intelligent Visual Computing  
3Meituan

UniToken 

📣 News

  • [2025-04-02] 🎉🎉🎉 UniToken paper is accepted to CVPR 2025 workshop! 🎉🎉🎉
  • [2025-04-01] 🎉🎉🎉 We release the recaptioned text prompts of GenEval and T2I-Compbench! 🎉🎉🎉
  • [2025-04-01] 🎉🎉🎉 UniToken paper and training codes are released! 🎉🎉🎉

🛠️ Installation

See INSTALL.md for detailed instructions.

🎓 Training

See unitoken/TRAIN.md

🤖 Inference

Preparation

Download the original VQ-VAE weights, Lumina-mGPT-512 and SigLIP, and put them to the following directory:

UniToken
- unitoken/
    - ckpts/
        - chameleon/
            - tokenizer/
                - text_tokenizer.json
                - vqgan.yaml
                - vqgan.ckpt
        - Lumina-mGPT-7B-512/
        - SigLIP/
- xllmx/
- ...

Simple Inference

The simplest code for UniToken inference:

from inference_solver_anyres import FlexARInferenceSolverAnyRes
from PIL import Image

# ******************** Image Generation ********************
inference_solver = FlexARInferenceSolverAnyRes(
    model_path="OceanJay/UniToken-AnyRes-StageII",
    precision="bf16",
    target_size=512,
)

q1 = f"Generate an image according to the following prompt:\n" \
     f"A majestic phoenix with fiery wings soaring above a tranquil mountain lake, casting shimmering reflections on the water. Sparks and embers trail behind it as the sky glows with hues of orange and gold."

# generated: tuple of (generated response, list of generated images)
generated = inference_solver.generate_img(
    images=[],
    qas=[[q1, None]],
    max_gen_len=1536,
    temperature=1.0,
    logits_processor=inference_solver.create_logits_processor(cfg=3.0, image_top_k=4000),
)

a1, new_image = generated[0], generated[1][0]


# ******************* Image Understanding ******************
inference_solver = FlexARInferenceSolverAnyRes(
    model_path="OceanJay/UniToken-AnyRes-StageII",
    precision="bf16",
    target_size=512,
)

# "<|image|>" symbol will be replaced with sequence of image tokens before fed to LLM
q1 = "<|image|>Please describe the details of the image as much as possible."

images = [Image.open("../assets/1.png").convert('RGB')]
qas = [[q1, None]]

# `len(images)` should be equal to the number of appearance of "<|image|>" in qas
generated = inference_solver.generate(
    images=images,
    qas=qas,
    max_gen_len=512,
    temperature=1.0,
    logits_processor=inference_solver.create_logits_processor(cfg=4.0, image_top_k=2000),
)

a1 = generated[0]
# generated[1], namely the list of newly generated images, should typically be empty in this case.

🤗 Checkpoints

Model Huggingface
UniToken-base-StageI OceanJay/UniToken-base-StageI
UniToken-base-StageII OceanJay/UniToken-base-StageII
UniToken-AnyRes-StageI OceanJay/UniToken-AnyRes-StageI
UniToken-AnyRes-StageII OceanJay/UniToken-AnyRes-StageII

📚 Datasets

We've observed that existing text-to-image generation models struggle with short text prompts in benchmarks such as GenEval and T2I-Compbench++. To address this issue, we have revised these prompts to be more descriptive. We are excited to share our enhanced version on Hugging Face. We encourage you to try it out and see the improvements for your own model!

🙏 Acknowledgement

We sincerely appreciate Lumina-mGPT for providing high-quality training codes, as well as Emu3 and Janus for releasing pretrained checkpoints for evaluation.

📄 Citation

@article{jiao2025unitoken,
  title={UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding},
  author={Jiao, Yang and Qiu, Haibo and Jie, Zequn and Chen, Shaoxiang and Chen, Jingjing and Ma, Lin and Jiang, Yu-Gang},
  journal={arXiv preprint arXiv:2504.04423},
  year={2025}
}

About

[CVPRW 2025] UniToken is an auto-regressive generation model that combines discrete and continuous representations to process visual inputs, making it easy to integrate both visual understanding and image generation tasks seamlessly.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published