Skip to content

Conversation

@PaliC
Copy link
Collaborator

@PaliC PaliC commented Mar 18, 2025

This PR adds a triton backend to kernel bench. To invoke it simply add backend="triton" to the following 4 scripts (use them as normal otherwise)

scripts/generate_and_eval_single_sample_modal.py
scripts/generate_and_eval_single_sample.py
scripts/generate_samples.py
scripts/eval_from_generations.py

This PR also adds a {error_type}_name into the eval json. The reason for this is that it makes classifying errors (especially for triton) much easier. For example, from the error log it isn't obvious what an error is (ie. you might get at 37:15:\n h_start = pooled_row * stride - padding\n w_start = pooled_col * stride - padding\n\n # Initialize the max value\n max_val = tl.full((1,), float('-inf'), tl.float32)\n\n # Itera...). But if the error name is triton.compiler.errors.UnsupportedLanguageConstruct it's a lot more obvious.

Testing: I've tested the 4 scripts in both the triton and cuda variants and they seem to work normally. (outside of scripts/generate_and_eval_single_sample_modal.py which should be equivalent to scripts/generate_and_eval_single_sample.py)

Todo:

  • This PR is a little bit crude as it just adds the logic (and linting (sorry meta's IDE does this automatically). There are other things like updating the readme, and a bit of logical refactoring (use kernel instead of cuda for example). But I will leave that to a followup PR as this one is already 1000 lines.
  • Add fewshot for triton
  • Add CoT for triton

Below is the github copilot generated summary which is honestly pretty useful for navigating large PRs.

==========================================================================================
This pull request includes several changes to improve code readability and add new functionality to the scripts/eval_from_generations.py and scripts/generate_and_eval_single_sample.py files. The most notable changes include reformatting code for better readability, adding a new backend configuration option, and enhancing error logging.

Code readability improvements:

New functionality:

Error logging enhancements:

Miscellaneous:

@george-mako
Copy link

@simonguozirui Any update on when this will be merged to main?

@ai-nikolai
Copy link

@PaliC is this still up to date, or are there big changes in KernelBench since this PR was drafted?

@simonguozirui
Copy link
Collaborator

simonguozirui commented Oct 3, 2025

@AffectionateCurry and I are back working to merge this.
The key consideration is to think about code paths and modular design that allow for other future programming language support aka (tile-lang, ThunderKitten, hip, nki, cute, cutlass).

For jit-compile language it is quite easy to do so, for frameworks that require building and linking that is much more complicated.

@ai-nikolai
Copy link

@simonguozirui and @AffectionateCurry - amazing work.

I am currently actively looking into KernelBench + extension myself. Let me know if you guys are open to collaboration.

Copy link
Collaborator

@simonguozirui simonguozirui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @AffectionateCurry @nathanjpaek
Look through my comments and see what you all think. I like many of the abstractions you all put in (along with the changes @PaliC did).

Can we ensure the new changes don't break existing CUDA pipeline. Test CUDA / Triton / CuTE on both local L40S Lab machine and modal cloud execution

anthropic
modal
numpy
openai
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we remove requirements.txt? we should keep it? we can think about using uv later but not get rid of it here.

@nathanjpaek don't we also need to add tilelang here


with open(eval_file_path, "w") as f:
json.dump(eval_results, f)
json.dump(eval_results, f, indent=4)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great check

elif config.dataset_src == "local":
problem_idx_in_dataset = config.problem_id - 1 # due to dataset list being 0-indexed locally
problem_idx_in_dataset = (
config.problem_id - 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pythonomar22 this is something we will get rid of with your new benchmark data class so we dont' have to deal with all these nasty off-by-one issue


# Use appropriate prompt constructor based on backend
if config.backend == "cuda":
custom_prompt = prompt_generate_custom_cuda_from_prompt_template(ref_arch_src)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AffectionateCurry i see what you mean here now. we can refactor this later with a better prompt template!

custom_prompt = get_prompt_for_backend(ref_arch_src, config.backend)
else:
raise ValueError(
f"Unsupported backend: {config.backend}. Must be 'cuda', 'triton', or 'cute'."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch here, we shall update read me in the pre-GPU mode hackathon to list these are available options

deleted manually be the caller.
This is a hack that is needed for triton code as compile / exec do not play well
with the @triton.jit decorator.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we did this for some of the multi-turn kernelbench experiments too. so we might need to support this for cuda code path as well.
Right now @AffectionateCurry you should state this is only used invoked upon for alternative backends

return ModelNew, temp_file


# def load_tilelang_model(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can do this in a later PR

# Create a new module based on that spec
temp_module = importlib.util.module_from_spec(spec)
# Execute the code in the module's namespace
spec.loader.exec_module(temp_module)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how safe is this haha, we shall really understand [in case any reward hacking]

import os
from .utils import read_file

"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great first step, we will replace this with something a bit more modular later on!

return tensor.to(device=device)

# Apply backend-specific dtype casting for float tensors
# if backend.lower() == "tilelang":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we write this function just for tile lang?

In general i actually quite like this abstraction we can do some checks etc

@simonguozirui
Copy link
Collaborator

simonguozirui commented Oct 23, 2025

@nathanjpaek thanks for adding an one-shot example for CuTE, we can figure out how to do tilelang support in another PR.

@AffectionateCurry have checked the current state of PR work for cuda, triton, cute, on both local and modal execution.

We will make prompt construction and eval logic more clean across backends in future PRs.

Great job @nathanjpaek @AffectionateCurry for your first PR! Thank you to @PaliC @msaroufim @Zacharias030 and the PyTorch team for your help!

siddagra pushed a commit to siddagra/KernelBench that referenced this pull request Nov 10, 2025
…ce#35)

* triton_backend_v2

* fix eval bugs

* fix issues

* revert eval

* remove traceback

* remove cot

* improve eval

* looked over pr and added future support for other languages

* updated requirements

* added back requirements.txt

* add cute one shot addition example

* remove unncessary files and redo requirements

* let's see if that fixes it

* fix config in file suggested soksoerey

* move natalia's old file into change log

---------

Co-authored-by: AffectionateCurry <[email protected]>
Co-authored-by: nathanjpaek <[email protected]>
Co-authored-by: Simon Guo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants