Training Optimal Large Diffusion Language Models

Jinjie Ni†, Qian Liu, Chao Du, Longxu Dou, Hang Yan, Zili Wang, Tianyu Pang, Michael Qizhe Shieh

†Correspondence to: Jinjie Ni <[email protected]>

Large-Scale Scaling Laws for Diffusion Language Models.

News

[2025-10-27] We release the codebase, all training checkpoints, and logs. The codebase is highly optimized and is industry-level in terms scalability and efficiency.

[2025-10-03] The full paper is out! Check it out here!

Code

The codebase is released here. It is a highly-optimized codebase for any-scale DLMs training.

You can also use the code under the mega-dlms folder of this repo, which might not be actively maintained.

Resources

We opensource all model checkpoints and training logs mentioned in the paper. All of them can be downloaded at https://huggingface.co/collections/jinjieni/mdga.

The easiest way to download a folder is using this script (setup the variables properly):

python utils/hf_download_folder.py

Alternatively, you can also use wget to directly download individual files from the folder, e.g.:

wget https://huggingface.co/datasets/MDGA-1/quokka_logs/tree/main/batch_size1/1024_96b_1e_1b_ar/tensorboard/events.out.tfevents.1756378168.7837477398

The name pattern for Compute-constrained scaling laws: {FLOPs}_{model_size (M)}_{training_tokens}

The name pattern for Data-constrained scaling laws: {model_size (M)}_{training_tokens}_{repetition}_{anneal_steps}_{warmup_steps}_{micro_batch_size}_{global_batch_size}_{learning_rate}

We link the related resources below:

[ckpt][log] Compute-constrained scaling laws
[ckpt][log] Data-constrained scaling laws
Key Modeling and Optimization Choices
- Masked and uniform transition kernels
  - [ckpt][log] masked
  - [ckpt][log] uniform
- diffusion schedules
  - [ckpt][log] cosine
  - [ckpt][log] linear
  - [ckpt][log] poly2
- Uniform 𝑡 vs. clean-to-noisy 𝑡 sampling
  - [ckpt][log] default
  - [ckpt][log] moving gaussian
- Principled diffusion loss and MaskGIT loss
  - [ckpt][log] diffusion
  - [ckpt][log] maskgit
- Batch size transferability from AR models to DLMs
  - 256
    - [ckpt][log] AR
    - [ckpt][log] DLM
  - 1024
    - [ckpt][log] AR
    - [ckpt][log] DLM
  - 4096
    - [ckpt][log] AR
    - [ckpt][log] DLM
- Learning rate transferability from AR models to DLMs
  - 1e-4
    - [ckpt][log] AR
    - [ckpt][log] DLM
  - 2e-4
    - [ckpt][log] AR
    - [ckpt][log] DLM
  - 4e-4
    - [ckpt][log] AR
    - [ckpt][log] DLM
- The impact of weight decay on AR models in single epoch scenarios
  - [ckpt][log] with
  - [ckpt][log] without
- The impact of weight decay on DLMs in single epoch scenarios
  - [ckpt][log] with
  - [ckpt][log] without
- The impact of weight decay on AR models in multi-epoch scenarios
  - [ckpt][log] with
  - [ckpt][log] without
- The impact of weight decay on DLMs in multi-epoch scenarios
  - [ckpt][log] with
  - [ckpt][log] without

You can refer to this script to inference with the huggingface checkpoints. Due to the large amount, most small checkpoints above are still in megatron formats. You may refer to this script to convert them (need to tweak the conversion scripts).

Citation

@article{ni2025training,
  title={Training Optimal Large Diffusion Language Models},
  author={Ni, Jinjie and Liu, Qian and Du, Chao and Dou, Longxu and Yan, Hang and Wang, Zili and Pang, Tianyu and Shieh, Michael Qizhe},
  journal={arXiv preprint arXiv:2510.03280},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
mega-dlms		mega-dlms
resources		resources
utils		utils
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Training Optimal Large Diffusion Language Models

Large-Scale Scaling Laws for Diffusion Language Models.

News

Code

Resources

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Training Optimal Large Diffusion Language Models

Large-Scale Scaling Laws for Diffusion Language Models.

News

Code

Resources

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages