This project provides a complete pipeline for training LLMs to automatically generate efficient and correct CUDA kernels. By leveraging a two-stage process of SFT and RL, this framework fine-tunes a base model to write optimized CUDA code.
For demonstration purposes, this guide uses Qwen3-8B as the base model.
The training methodology is composed of two main stages:
- SFT: The base LLM is first fine-tuned on a high-quality dataset of CUDA kernel examples. The data is generated by DeepSeek R1, DeepSeel Coder-7B, and Qwen2-32B.
- RL: After SFT, the model is further optimized through reinforcement learning. In this stage, the model generates CUDA kernels which are then compiled and tested. This feedback signal is used as a reward to train the model to produce valid kernels.
To set up and run the training pipeline, follow these steps:
First, you need to process the raw datasets for SFT and RL, and download the evaluation dataset. This script handles the necessary preprocessing.
- SFT Dataset:
sft_cuda_llm_r1.parquet - RL Dataset:
rl_cuda_llm_0424.parquet - Evaluation Dataset: KernelBench
Run the following command to begin:
# install verl, the git SHA is abb87bc147467589d1357dd80a1e3fefa188e11f
git clone https://github.com/volcengine/verl.git
cd verl
pip install --no-deps -e .
cd ..
python3 cuda_dataset.pyNext, fine-tune the base model on the prepared SFT dataset. This will adapt the model to the domain of CUDA code generation.
bash scripts/sft.shAfter the SFT stage is complete, evaluate the model's code generation accuracy on the KernelBench benchmark. This step provides a baseline measurement of the model's capabilities before reinforcement learning.
bash scripts/eval.shFinally, use reinforcement learning to further enhance the SFT model's ability to generate performant code. For each node, this stage uses a hardware allocation as below:
- 4x GPUs are dedicated to the RL training loop.
- 4x GPUs are used to run the generated kernels, providing the reward needed for training.
bash scripts/rl.shUpon completion, you will have a model specifically trained to generate high-quality CUDA kernels. You can re-run the evaluation script (eval.sh) to measure the performance uplift from the RL stage.
This project is licensed under the Apache License 2.0. See the LICENSE file for details.