Skip to content

ByteDance-Seed/cudaLLM

Repository files navigation

CudaLLM: Training Language Models to Generate High-Performance CUDA Kernels

CudaLLM Model CudaLLM Data

This project provides a complete pipeline for training LLMs to automatically generate efficient and correct CUDA kernels. By leveraging a two-stage process of SFT and RL, this framework fine-tunes a base model to write optimized CUDA code.

For demonstration purposes, this guide uses Qwen3-8B as the base model.

How It Works

The training methodology is composed of two main stages:

  1. SFT: The base LLM is first fine-tuned on a high-quality dataset of CUDA kernel examples. The data is generated by DeepSeek R1, DeepSeel Coder-7B, and Qwen2-32B.
  2. RL: After SFT, the model is further optimized through reinforcement learning. In this stage, the model generates CUDA kernels which are then compiled and tested. This feedback signal is used as a reward to train the model to produce valid kernels.

Getting Started

To set up and run the training pipeline, follow these steps:

Step 0: Prepare Datasets

First, you need to process the raw datasets for SFT and RL, and download the evaluation dataset. This script handles the necessary preprocessing.

  • SFT Dataset: sft_cuda_llm_r1.parquet
  • RL Dataset: rl_cuda_llm_0424.parquet
  • Evaluation Dataset: KernelBench

Run the following command to begin:

# install verl, the git SHA is abb87bc147467589d1357dd80a1e3fefa188e11f
git clone https://github.com/volcengine/verl.git
cd verl
pip install --no-deps -e .
cd ..

python3 cuda_dataset.py

Step 1: SFT

Next, fine-tune the base model on the prepared SFT dataset. This will adapt the model to the domain of CUDA code generation.

bash scripts/sft.sh

Step 2: Evaluate the SFT Model

After the SFT stage is complete, evaluate the model's code generation accuracy on the KernelBench benchmark. This step provides a baseline measurement of the model's capabilities before reinforcement learning.

bash scripts/eval.sh

Step 3: RL

Finally, use reinforcement learning to further enhance the SFT model's ability to generate performant code. For each node, this stage uses a hardware allocation as below:

  • 4x GPUs are dedicated to the RL training loop.
  • 4x GPUs are used to run the generated kernels, providing the reward needed for training.
bash scripts/rl.sh

Upon completion, you will have a model specifically trained to generate high-quality CUDA kernels. You can re-run the evaluation script (eval.sh) to measure the performance uplift from the RL stage.

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors