CudaLLM: Training Language Models to Generate High-Performance CUDA Kernels

This project provides a complete pipeline for training LLMs to automatically generate efficient and correct CUDA kernels. By leveraging a two-stage process of SFT and RL, this framework fine-tunes a base model to write optimized CUDA code.

For demonstration purposes, this guide uses Qwen3-8B as the base model.

How It Works

The training methodology is composed of two main stages:

SFT: The base LLM is first fine-tuned on a high-quality dataset of CUDA kernel examples. The data is generated by DeepSeek R1, DeepSeel Coder-7B, and Qwen2-32B.
RL: After SFT, the model is further optimized through reinforcement learning. In this stage, the model generates CUDA kernels which are then compiled and tested. This feedback signal is used as a reward to train the model to produce valid kernels.

Getting Started

To set up and run the training pipeline, follow these steps:

Step 0: Prepare Datasets

First, you need to process the raw datasets for SFT and RL, and download the evaluation dataset. This script handles the necessary preprocessing.

SFT Dataset: sft_cuda_llm_r1.parquet
RL Dataset: rl_cuda_llm_0424.parquet
Evaluation Dataset: KernelBench

Run the following command to begin:

# install verl, the git SHA is abb87bc147467589d1357dd80a1e3fefa188e11f
git clone https://github.com/volcengine/verl.git
cd verl
pip install --no-deps -e .
cd ..

python3 cuda_dataset.py

Step 1: SFT

Next, fine-tune the base model on the prepared SFT dataset. This will adapt the model to the domain of CUDA code generation.

bash scripts/sft.sh

Step 2: Evaluate the SFT Model

After the SFT stage is complete, evaluate the model's code generation accuracy on the KernelBench benchmark. This step provides a baseline measurement of the model's capabilities before reinforcement learning.

bash scripts/eval.sh

Step 3: RL

Finally, use reinforcement learning to further enhance the SFT model's ability to generate performant code. For each node, this stage uses a hardware allocation as below:

4x GPUs are dedicated to the RL training loop.
4x GPUs are used to run the generated kernels, providing the reward needed for training.

bash scripts/rl.sh

Upon completion, you will have a model specifically trained to generate high-quality CUDA kernels. You can re-run the evaluation script (eval.sh) to measure the performance uplift from the RL stage.

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
config		config
scripts		scripts
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE.txt		NOTICE.txt
README.md		README.md
cuda_dataset.py		cuda_dataset.py
cuda_rm.py		cuda_rm.py
main_ppo.py		main_ppo.py
main_val.py		main_val.py
ray_trainer.py		ray_trainer.py
response_post_proc.py		response_post_proc.py
validation_manager.py		validation_manager.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CudaLLM: Training Language Models to Generate High-Performance CUDA Kernels

How It Works

Getting Started

Step 0: Prepare Datasets

Step 1: SFT

Step 2: Evaluate the SFT Model

Step 3: RL

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CudaLLM: Training Language Models to Generate High-Performance CUDA Kernels

How It Works

Getting Started

Step 0: Prepare Datasets

Step 1: SFT

Step 2: Evaluate the SFT Model

Step 3: RL

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages