Employing LLMs for visual generation has recently become a research focus. However, existing methods primarily transfer the LLM architecture to visual generation without considering fundamental differences between language and vision. We present IAR, an Improved AutoRegressive Visual Generation method, which leverages the correlation of visual embeddings to enhance model training efficiency and generation quality. We propose a Codebook Rearrangement strategy using balanced k-means clustering and a Cluster-oriented Cross-entropy Loss, significantly improving robustness and quality. Extensive experiments demonstrate that IAR improves performance while reducing training time by half across different model sizes.
We propose IAR, an Improved Autoregressive Visual Generation method, which leverages the correlation of visual embeddings to enhance the training efficiency and performance of LLM-based visual generation models.
Figure 1. Model framework: 1) Codebook Rearrangement: we first use a balanced K-means clustering method to rearrange the codebook, which divides the codebook into n clusters, with the image codes in each cluster sharing a high similarity. 2) Cluster-oriented Constraint: During the training process, we first quantize the image patches using the rearranged codebook.
Figure 2. When an autoregressive model predicts a wrong token, the previous methods may predict an irrelevant token that causes artifacts. Our IAR alleviates this issue by ensuring a high probability of the predicted token located in the correct cluster.
We introduce a Codebook Rearrangement strategy that uses balanced K-means clustering algorithm to rearrange the image codebook, ensuring high similarity across image embeddings within each cluster.
We propose a Cluster-oriented Cross-entropy Loss that relaxes the original token-oriented cross-entropy, ensuring that even if the model predicts the wrong token index, there is still a high probability that the token is in the correct cluster, thereby generating high-quality images.
We compare our model with the GAN-based methods, diffusion-based methods, mask-prediction-based methods and autoregressive-based methods on ImageNet.
Table 1. Comparison between different types of image generation model on class-conditional ImageNet 256×256 benchmark with FID, IS, precision, and recall.
We compare our model with the GAN-based methods, diffusion-based methods, mask-prediction-based methods and autoregressive-based methods on ImageNet.
Table 2. Comparison with LlamaGen across different image tokens and model sizes. Following LlamaGen, we only train XL and XXL version on 16 × 16 tokens for 50 epochs.
We also conduct experiments to analyze how CFG levels impact quality, how model size influences performance, and how training speed varies across models.
Figure 3. (a) Model performance (IAR-B and IAR-L) across different CFGs; (b) Model performance on different parameter numbers (111M to 3B) compared to LlamaGen; and (c) Model performance on different epochs compared to LlamaGen.
@article{hu2025improving,
title={Improving Autoregressive Visual Generation with Cluster-Oriented Token Prediction},
author={Hu, Teng and Zhang, Jiangning and Yi, Ran and Weng, Jieyu and Wang, Yabiao and Zeng, Xianfang and Xue, Zhucun and Ma, Lizhuang},
journal={arXiv preprint arXiv:2501.00880},
year={2025}
}