Skip to content

JinjieNi/Quokka

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Training Optimal Large Diffusion Language Models

Jinjie Ni†, Qian Liu, Chao Du, Longxu Dou, Hang Yan, Zili Wang, Tianyu Pang, Michael Qizhe Shieh

†Correspondence to: Jinjie Ni <[email protected]>


Large-Scale Scaling Laws for Diffusion Language Models.

Static Badge Static Badge Static Badge Twitter

News

[2025-10-27] We release the codebase, all training checkpoints, and logs. The codebase is highly optimized and is industry-level in terms scalability and efficiency.

[2025-10-03] The full paper is out! Check it out here!


Code

The codebase is released here. It is a highly-optimized codebase for any-scale DLMs training.

You can also use the code under the mega-dlms folder of this repo, which might not be actively maintained.


Resources

We opensource all model checkpoints and training logs mentioned in the paper. All of them can be downloaded at https://huggingface.co/collections/jinjieni/mdga.

The easiest way to download a folder is using this script (setup the variables properly):

python utils/hf_download_folder.py

Alternatively, you can also use wget to directly download individual files from the folder, e.g.:

wget https://huggingface.co/datasets/MDGA-1/quokka_logs/tree/main/batch_size1/1024_96b_1e_1b_ar/tensorboard/events.out.tfevents.1756378168.7837477398

The name pattern for Compute-constrained scaling laws: {FLOPs}_{model_size (M)}_{training_tokens}

The name pattern for Data-constrained scaling laws: {model_size (M)}_{training_tokens}_{repetition}_{anneal_steps}_{warmup_steps}_{micro_batch_size}_{global_batch_size}_{learning_rate}

We link the related resources below:

  • [ckpt][log] Compute-constrained scaling laws
  • [ckpt][log] Data-constrained scaling laws
  • Key Modeling and Optimization Choices
    • Masked and uniform transition kernels
    • diffusion schedules
    • Uniform 𝑡 vs. clean-to-noisy 𝑡 sampling
    • Principled diffusion loss and MaskGIT loss
    • Batch size transferability from AR models to DLMs
    • Learning rate transferability from AR models to DLMs
    • The impact of weight decay on AR models in single epoch scenarios
    • The impact of weight decay on DLMs in single epoch scenarios
    • The impact of weight decay on AR models in multi-epoch scenarios
    • The impact of weight decay on DLMs in multi-epoch scenarios

You can refer to this script to inference with the huggingface checkpoints. Due to the large amount, most small checkpoints above are still in megatron formats. You may refer to this script to convert them (need to tweak the conversion scripts).


Citation

@article{ni2025training,
  title={Training Optimal Large Diffusion Language Models},
  author={Ni, Jinjie and Liu, Qian and Du, Chao and Dou, Longxu and Yan, Hang and Wang, Zili and Pang, Tianyu and Shieh, Michael Qizhe},
  journal={arXiv preprint arXiv:2510.03280},
  year={2025}
}

About

The official github repo for "Training Optimal Large Diffusion Language Models", the first-ever large-scale diffusion language models scaling law..

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors