Haolei Bai1, Lingcheng Kong1,2, Xueyi Chen1, Jiamian Wang3, Zhiqiang Tao3, Huan Wang1
1Westlake University, 2The Hong Kong University of Science and Technology, 3Rochester Institute of Technology
- 2026.03.03: We release our code of training and evaluation!
- 2026.02.13: We release DICE-1.7B, DICE-4B, and DICE-8B on Hugging Face !
- 2026.02.13: The paper is on arXiv !
conda env create -f environment.ymlWe provide the curated CuKe dataset for SFT in
DICE/training/sft/llama_factory_sdar/data/CuKe_dataset.jsonWe provide the data for two stages of BiC-RL in the folder
DICE/rl_dataWe follow the training process of SDAR, you may check here for more instruction.
cd DICE/training/sft/llama_factory_sdar
torchrun --nnodes 1 --node_rank 0 --nproc_per_node 8 --master_addr 127.0.0.1 --master_port 12345 ./src/llamafactory/launcher.py ./examples/train_full_sdar/sdar_8b_full.yamlcd DICE/training/rl
# kernel infilling stage
python rl.py config=configs/rl_sdar_kernel_infilling-8b.yaml
# end-to-end kernel generation stage
python rl.py config=configs/rl_sdar_kernel_final-8b.yamlWe evaluate all models based on KernelBench. You can train SDAR series models based on the provided training scripts or you can directly download the DICE series models on Hugging Face.
cd DICE/evaluation
# generation
python scripts/generate_samples.py run_name=DICE_8b_level_1 dataset_src=huggingface level=1 use_local_model=True local_model_path="/path/to/DICE-8B/" gen_length=4096
python scripts/generate_samples.py run_name=DICE_8b_level_2 dataset_src=huggingface level=2 use_local_model=True local_model_path="/path/to/DICE-8B/" gen_length=4096
python scripts/generate_samples.py run_name=DICE_8b_level_3 dataset_src=huggingface level=3 use_local_model=True local_model_path="/path/to/DICE-8B/" gen_length=4096
# evaluation
python scripts/eval_from_generations.py run_name=DICE_8b_level_1 dataset_src=local level=1 timeout=300
# you need to first obtain the baseline time on your hardware (please refer to KernelBench)
python scripts/benchmark_eval_analysis.py run_name=DICE_8b_level_1 level=1 hardware=A100 baseline=baseline_time_torchWe are grateful to the SDAR, TraceRL, KernelBench, cudaLLM for releasing their code publicly, which greatly facilitated our work.
If you find DICE useful for your research or projects, please consider citing our work:
@article{bai2026dice,
title={DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels},
author={Bai, Haolei and Kong, Lingcheng and Chen, Xueyi and Wang, Jiamian and Tao, Zhiqiang and Wang, Huan},
journal={arXiv preprint arXiv:2602.11715},
year={2026}
}