Improving Autoregressive Visual Generation with Cluster-Oriented Token Prediction

¹Shanghai Jiao Tong University ²Youtu Lab, Tencent ³Zhejiang University

Abstract

Employing LLMs for visual generation has recently become a research focus. However, existing methods primarily transfer the LLM architecture to visual generation without considering fundamental differences between language and vision. We present IAR, an Improved AutoRegressive Visual Generation method, which leverages the correlation of visual embeddings to enhance model training efficiency and generation quality. We propose a Codebook Rearrangement strategy using balanced k-means clustering and a Cluster-oriented Cross-entropy Loss, significantly improving robustness and quality. Extensive experiments demonstrate that IAR improves performance while reducing training time by half across different model sizes.

Method

We propose IAR, an Improved Autoregressive Visual Generation method, which leverages the correlation of visual embeddings to enhance the training efficiency and performance of LLM-based visual generation models.

Figure 1. Model framework: 1) Codebook Rearrangement: we first use a balanced K-means clustering method to rearrange the codebook, which divides the codebook into n clusters, with the image codes in each cluster sharing a high similarity. 2) Cluster-oriented Constraint: During the training process, we first quantize the image patches using the rearranged codebook.

Figure 2. When an autoregressive model predicts a wrong token, the previous methods may predict an irrelevant token that causes artifacts. Our IAR alleviates this issue by ensuring a high probability of the predicted token located in the correct cluster.

Implementation

We introduce a Codebook Rearrangement strategy that uses balanced K-means clustering algorithm to rearrange the image codebook, ensuring high similarity across image embeddings within each cluster.

We propose a Cluster-oriented Cross-entropy Loss that relaxes the original token-oriented cross-entropy, ensuring that even if the model predicts the wrong token index, there is still a high probability that the token is in the correct cluster, thereby generating high-quality images.

BibTeX

@article{hu2025improving, title={Improving Autoregressive Visual Generation with Cluster-Oriented Token Prediction}, author={Hu, Teng and Zhang, Jiangning and Yi, Ran and Weng, Jieyu and Wang, Yabiao and Zeng, Xianfang and Xue, Zhucun and Ma, Lizhuang}, journal={arXiv preprint arXiv:2501.00880}, year={2025} }

Improving Autoregressive Visual Generation with Cluster-Oriented Token Prediction

Abstract

Method

Implementation

Comparisons with baseline

More comparisons with LlamaGen

Guidance Effects, Scaling, and Efficiency

BibTeX