Muddit is the 2nd generation Meissonic. It is built upon discrete diffusion for unified and efficient multimodal generation.
Unlike traditional autoregressive methods, Muddit leverages discrete diffusion with absorbing kernel (a.k.a. MaskGIT-style masking) as its core mechanism — enabling fast, parallel decoding across modalities.
While most unified models are still rooted in language priors, Muddit is developed from a visual-first perspective for scalable and flexible generation. This aligns Jim Fan's The Second Pre-training Paradigm.
Muddit (512) and Muddit Plus (1024) aim to handle diverse tasks across modalities -- such as text generation, image generation, and vision-language reasoning -- within a single architecture and decoding paradigm.
- Unified discrete diffusion for both text and image generation under a shared MM-DiT backbone.
- Built on Meissonic visual priors, instead of training a unified model from scratch.
- One model, three tasks: text-to-image, image captioning, and VQA.
- 1B parameters with competitive performance against much larger autoregressive multimodal models.
- Fast parallel decoding with strong real-world throughput on both image and text generation.
-
Text-to-Image: Generate high-quality images from text prompts.
-
Image-to-Text: Generate descriptive captions from input images.
-
Visual Question Answering: Answer questions conditioned on both image and text.
| Model | Params | GenEval | MS-COCO CIDEr | VQAv2 | MME | GQA | MMMU |
|---|---|---|---|---|---|---|---|
| Muddit (512) | 1B | 0.61 | 59.9 | 68.2 | 1107.4 | 57.5 | 27.6 |
| Muddit (1024) | 1B | - | 60.1 | 70.2 | 1139.2 | 57.8 | 28.7 |
| Model | Resolution | Steps | Text-to-Image (img/s) | Image-to-Text (token/s) |
|---|---|---|---|---|
| UniDisc | 512 | 32 | 0.89 | 79.36 |
| D-DiT | 512 | 28 | 0.62 | 26.89 |
| Muddit | 512 | 32 | 1.00 | 99.98 |
Muddit matches the best reported text-to-image throughput in this comparison while achieving the fastest image-to-text decoding among the listed non-autoregressive baselines.
git clone https://github.com/M-E-AGI-Lab/Muddit.git
cd Muddit
pip install -r requirements.txtPlease refer to https://huggingface.co/spaces/MeissonFlow/muddit/blob/main/app.py
bash inference_t2i.sh bash inference_i2t.shTo train Muddit, follow these steps:
-
Install dependencies:
pip install -r requirements.txt
-
Prepare your own dataset and dataset class following the format in dataset_utils.py and train_meissonic.py
- Modify train.sh with your dataset path
-
Start training:
bash train/train_unified.sh
Note: For custom datasets, you'll likely need to implement your own dataset class.
Muddit uses a shared MM-DiT generator for both text and image tokens. During training, one modality is randomly masked and reconstructed conditioned on the other. During inference, Muddit starts from fully masked tokens and iteratively denoises them in parallel.
This gives a genuinely unified framework where the same diffusion process supports:
- text → image
- image → text
- image + question → answer
-
MaskGIT: Masked Generative Image Transformer [CVPR 2022]
-
Muse: Text-To-Image Generation via Masked Generative Transformers [ICML 2023]
-
[🌟]Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis [ICLR 2025]
-
Bag of Design Choices for Inference of High-Resolution Masked Generative Transformer
-
Di[𝙼]O: Distilling Masked Diffusion Models into One-step Generator [ICCV 2025]
-
[🌟]Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model [ICLR 2026]
-
DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer [ICCV 2025]
-
MDNS: Masked Diffusion Neural Sampler via Stochastic Optimal Control
-
Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation
-
[🌟]Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding
-
Token Painter: Training-Free Text-Guided Image Inpainting via Mask Autoregressive Models
-
TR2-D2: Tree Search Guided Trajectory-Aware Fine-Tuning for Discrete Diffusion
-
OneFlow: Concurrent Mixed-Modal and Interleaved Generation with Edit Flows
-
Diffuse Everything: Multimodal Diffusion Models on Arbitrary State Spaces [ICML 2025]
-
Towards Better & Faster Autoregressive Image Generation: From the Perspective of Entropy [NeurIPS 2025]
-
[🌟]From Masks to Worlds: A Hitchhiker's Guide to World Models
-
Soft-Di[M]O: Improving One-Step Discrete Image Generation with Soft Embeddings
-
Accelerating Inference of Masked Image Generators via Reinforcement Learning
-
Sparse-LaViDa: Sparse Multimodal Discrete Diffusion Language Models
-
Reinforcement Learning Meets Masked Generative Models: Mask-GRPO for Text-to-Image Generation
-
MaskFocus: Focusing Policy Optimization on Critical Steps for Masked Image Generation
-
Co-GRPO: Co-Optimized Group Relative Policy Optimization for Masked Diffusion Model
-
More papers are coming soon! See MeissonFlow Research (Organization Card) for more about our vision.
If you find this work helpful, please consider citing:
@article{shi2025muddit,
title={Muddit: Liberating generation beyond text-to-image with a unified discrete diffusion model},
author={Shi, Qingyu and Bai, Jinbin and Zhao, Zhuoran and Chai, Wenhao and Yu, Kaidong and Wu, Jianzong and Song, Shuangyong and Tong, Yunhai and Li, Xiangtai and Li, Xuelong and others},
journal={arXiv preprint arXiv:2505.23606},
year={2025}
}Made with ❤️ by the MeissonFlow Research
See MeissonFlow Research (Organization Card) for more about our vision.
