Skip to content

M-E-AGI-Lab/Muddit

Repository files navigation

Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

arXiv Hugging Face Demo

🚀 Introduction

Muddit is the 2nd generation Meissonic. It is built upon discrete diffusion for unified and efficient multimodal generation.

Unlike traditional autoregressive methods, Muddit leverages discrete diffusion with absorbing kernel (a.k.a. MaskGIT-style masking) as its core mechanism — enabling fast, parallel decoding across modalities.

While most unified models are still rooted in language priors, Muddit is developed from a visual-first perspective for scalable and flexible generation. This aligns Jim Fan's The Second Pre-training Paradigm.

Muddit (512) and Muddit Plus (1024) aim to handle diverse tasks across modalities -- such as text generation, image generation, and vision-language reasoning -- within a single architecture and decoding paradigm.

Tracing the Evolution of Unified Generation Foundation Models *Note: Data as of July 2025.


Highlights

  • Unified discrete diffusion for both text and image generation under a shared MM-DiT backbone.
  • Built on Meissonic visual priors, instead of training a unified model from scratch.
  • One model, three tasks: text-to-image, image captioning, and VQA.
  • 1B parameters with competitive performance against much larger autoregressive multimodal models.
  • Fast parallel decoding with strong real-world throughput on both image and text generation.

Supported Tasks

  • Text-to-Image: Generate high-quality images from text prompts.

  • Image-to-Text: Generate descriptive captions from input images.

  • Visual Question Answering: Answer questions conditioned on both image and text.

Benchmark Snapshot

Main results

Model Params GenEval MS-COCO CIDEr VQAv2 MME GQA MMMU
Muddit (512) 1B 0.61 59.9 68.2 1107.4 57.5 27.6
Muddit (1024) 1B - 60.1 70.2 1139.2 57.8 28.7

Efficiency

Model Resolution Steps Text-to-Image (img/s) Image-to-Text (token/s)
UniDisc 512 32 0.89 79.36
D-DiT 512 28 0.62 26.89
Muddit 512 32 1.00 99.98

Muddit matches the best reported text-to-image throughput in this comparison while achieving the fastest image-to-text decoding among the listed non-autoregressive baselines.

💡 Inference

Installation

git clone https://github.com/M-E-AGI-Lab/Muddit.git
cd Muddit
pip install -r requirements.txt

Gradio Web UI

Please refer to https://huggingface.co/spaces/MeissonFlow/muddit/blob/main/app.py

Text-to-Image

bash inference_t2i.sh 

Image-to-Text/VQA

bash inference_i2t.sh

🎓 Training

To train Muddit, follow these steps:

  1. Install dependencies:

    pip install -r requirements.txt
  2. Prepare your own dataset and dataset class following the format in dataset_utils.py and train_meissonic.py

    • Modify train.sh with your dataset path
  3. Start training:

    bash train/train_unified.sh

Note: For custom datasets, you'll likely need to implement your own dataset class.

Technical Overview

Muddit uses a shared MM-DiT generator for both text and image tokens. During training, one modality is randomly masked and reconstructed conditioned on the other. During inference, Muddit starts from fully masked tokens and iteratively denoises them in parallel.

This gives a genuinely unified framework where the same diffusion process supports:

  • text → image
  • image → text
  • image + question → answer

📝 Meissonic Updates and Family Papers

📚 Citation

If you find this work helpful, please consider citing:

@article{shi2025muddit,
  title={Muddit: Liberating generation beyond text-to-image with a unified discrete diffusion model},
  author={Shi, Qingyu and Bai, Jinbin and Zhao, Zhuoran and Chai, Wenhao and Yu, Kaidong and Wu, Jianzong and Song, Shuangyong and Tong, Yunhai and Li, Xiangtai and Li, Xuelong and others},
  journal={arXiv preprint arXiv:2505.23606},
  year={2025}
}

Star History Chart

Made with ❤️ by the MeissonFlow Research

See MeissonFlow Research (Organization Card) for more about our vision.

About

[ICLR 2026] Official Implementation of Muddit [Meissonic II]: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors