Jinjie Ni†, Qian Liu, Chao Du, Longxu Dou, Hang Yan, Zili Wang, Tianyu Pang, Michael Qizhe Shieh
†Correspondence to: Jinjie Ni <[email protected]>
[2025-10-27] We release the codebase, all training checkpoints, and logs. The codebase is highly optimized and is industry-level in terms scalability and efficiency.
[2025-10-03] The full paper is out! Check it out here!
The codebase is released here. It is a highly-optimized codebase for any-scale DLMs training.
You can also use the code under the
mega-dlmsfolder of this repo, which might not be actively maintained.
We opensource all model checkpoints and training logs mentioned in the paper. All of them can be downloaded at https://huggingface.co/collections/jinjieni/mdga.
The easiest way to download a folder is using this script (setup the variables properly):
python utils/hf_download_folder.py
Alternatively, you can also use wget to directly download individual files from the folder, e.g.:
wget https://huggingface.co/datasets/MDGA-1/quokka_logs/tree/main/batch_size1/1024_96b_1e_1b_ar/tensorboard/events.out.tfevents.1756378168.7837477398The name pattern for Compute-constrained scaling laws: {FLOPs}_{model_size (M)}_{training_tokens}
The name pattern for Data-constrained scaling laws: {model_size (M)}_{training_tokens}_{repetition}_{anneal_steps}_{warmup_steps}_{micro_batch_size}_{global_batch_size}_{learning_rate}
We link the related resources below:
- [ckpt][log] Compute-constrained scaling laws
- [ckpt][log] Data-constrained scaling laws
- Key Modeling and Optimization Choices
- Masked and uniform transition kernels
- diffusion schedules
- Uniform 𝑡 vs. clean-to-noisy 𝑡 sampling
- Principled diffusion loss and MaskGIT loss
- Batch size transferability from AR models to DLMs
- Learning rate transferability from AR models to DLMs
- The impact of weight decay on AR models in single epoch scenarios
- The impact of weight decay on DLMs in single epoch scenarios
- The impact of weight decay on AR models in multi-epoch scenarios
- The impact of weight decay on DLMs in multi-epoch scenarios
You can refer to this script to inference with the huggingface checkpoints. Due to the large amount, most small checkpoints above are still in megatron formats. You may refer to this script to convert them (need to tweak the conversion scripts).
@article{ni2025training,
title={Training Optimal Large Diffusion Language Models},
author={Ni, Jinjie and Liu, Qian and Du, Chao and Dou, Longxu and Yan, Hang and Wang, Zili and Pang, Tianyu and Shieh, Michael Qizhe},
journal={arXiv preprint arXiv:2510.03280},
year={2025}
}
