TUNA: Taming Unified Visual Representations for
Native Unified Multimodal Models

1Meta BizAI 2HKU 3University of Waterloo 4KAUST
Joint first authors, listed alphabetically by last name Core contributors *Joint project lead

Introducing TUNA, a family of native unified multimodal models

  • TUNA leverages unified visual representations to enable image/video understanding, image/video generation, and image editing within a single framework.
  • Our extensive experiments show that TUNA's unified visual representation is highly effective, achieving state-of-the-art performance across multiple multimodal understanding and generation tasks.
  • Our comprehensive ablation studies demonstrate the superiority of our unified visual representation design over existing methods with unified representations and other models employing decoupled representations.

Text-to-Video Generation

All videos have a resolution of 384×672 and a frame rate of 12 fps. Hover over each video to see the corresponding text prompt.




Citation

If you find our work helpful, please cite our paper:

@article{liu2025tuna,
        title={TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models},
        author={Liu, Zhiheng and Ren, Weiming and Liu, Haozhe and Zhou, Zijian and Chen, Shoufa and Qiu, Haonan and Huang, Xiaoke and An, Zhaochong and Yang, Fanny and Patel, Aditya and others},
        journal={arXiv preprint arXiv:2512.02014},
        year={2025}
      }
  }