Multiple Timing Eval Implementation #89

simonguozirui · 2025-11-10T18:57:47Z

PR Change Overview

Move timing related functions to a separate file outside of eval
Implement multiple timing evaluation methods
This would enable the kernel timing blog with @PaliC
This will also help inform best practices for kernel timing

We will do another PR to further modularize the eval function.

Concretely, we implemented:

cuda_event – Device-side timing with torch.cuda.event; measures kernel runtime on a single stream under cold L2 cache; used as the default kernel timing method as the original KernelBench.
do_bench – Thin wrapper over Triton’s triton.testing.do_bench; uses Triton’s time-based warmup/rep logic and returns all runtimes, matching Triton’s native benchmarking semantics.
do_bench_impl – Re-implementation of Triton do_bench that preserves its device-side timing but exposes explicit num_warmup / num_trials and aggressively clears Triton’s cache each trial.
host_time – Host-side wall-clock timing via time.perf_counter; measures end-to-end latency including Python, launch overhead, CUDA runtime, all streams, and OS scheduling; not recommended when you want pure device kernel time.

More on do_bench from @PaliC
For do_bench, https://triton-lang.org/main/python-api/generated/triton.testing.do_bench.html gives us what we want, but the function takes in (total) ms instead of iterations for calculating warmup reps and iterations. The changes simply copy over the function (which is fairly self contained) and overwrite the timing math with explicit iterations. Hence we implemented both. do_bench(as close to the original as possible as folks expect do_bench to act a certain way) and also an "do_bench like timing", which is do_bench_impl, that has control over warmup reps and iterations reps.

We also implement L2 Cache clearing via thrashing (shown in GPU mode PR and Triton runtime. We focus the eval timing on cold-cache performance.

They are all implemented with a unified interface. Also added top-level eval entry point argument to specify which timing_method to use.

Co-authored-by: Simon Guo <[email protected]> Co-authored-by: Pietro Marsella <[email protected]>

…to here; a few other to implement

…thods

…nto eval_changes

PaliC

I'm not too sure what the point of discard_first is, otherwise this generally looks solid. I do think if we are only supporting cuda we can actually get rid of the do_bench implementation as at this point it's very similar to time with cuda event.

src/unit_tests/test_eval_timing.py

src/timing.py

simonguozirui · 2025-12-15T04:11:06Z

I timed with the new test_eval_timing.py, do_bench and cuda_event are similar on H200, (M,N,K=2048) matmul on NVIDIA H200 with 5 warmup iterations and 100 reps (cannot specify for do_bench as it is adaptive reps)

Timing Method	Mean (ms)	Std	Min	Max	Trials
`cuda_event`	0.344	0.000636	0.343	0.346	100
`host_time`	0.404	0.000803	0.401	0.406	100
`do_bench`	0.343	0.00036	0.343	0.344	190
`do_bench_impl`	0.343	0.000272	0.343	0.344	100

All device-side timing methods (cuda_event, do_bench, do_bench_impl) agree at ~0.343-0.344ms with low variance. host_time shows the expected ~0.06ms overhead from host side overhead / synchronization.

Thanks @PaliC for review. I added discard_first for host side timing due to initialization cost. For cuda event and triton do_bench, reply on warmup should achieve the same effect. However, I do note with cuda_event first trial has always slightly higher time (with some overhead, not sure where it came from, still trying to play around); so for consistency I add discard_first.

src/unit_tests/test_eval_timing.py

simonguozirui · 2025-12-16T01:26:46Z

Thank you for review @alexzhang13 @bkal01! and also for the feedback from @ngc92.
Shoutout to @Marsella8 for helping and the refactor (great job doing research this quarter)!

Now @PaliC and I will keep writing the kernel profiling blogpost with these functions and continually update the eval / timing function with new timing related utilities.

For cuda Events profiling method, this version assume no adversarial cuda stream (aka computation is done on the main stream). We will have a separate PR (with unit test from @bkal01 and @alexzhang13 @ngc92 suggestions) for a version that hopefully can guard against adversarial cuda stream.

Merging this for now and we will have new PRs building on top of this!

simonguozirui · 2025-12-16T01:42:45Z

More on timing, a good reference is GPU mode's timing eval using cuda Event from their latest competitions
See their implemenation link

Another helpful resource is this lecture from GPU mode on how to do kernel benchmarking!

LeoXinhaoLee · 2025-12-16T07:08:08Z

I timed with the new test_eval_timing.py, do_bench and cuda_event are similar on H200, (M,N,K=2048) matmul on NVIDIA H200 with 5 warmup iterations and 100 reps (cannot specify for do_bench as it is adaptive reps)

Timing Method Mean (ms) Std Min Max Trials
cuda_event 0.344 0.000636 0.343 0.346 100
host_time 0.404 0.000803 0.401 0.406 100
do_bench 0.343 0.00036 0.343 0.344 190
do_bench_impl 0.343 0.000272 0.343 0.344 100
All device-side timing methods (cuda_event, do_bench, do_bench_impl) agree at ~0.343-0.344ms with low variance. host_time shows the expected ~0.06ms overhead from host side overhead / synchronization.

Thanks @PaliC for review. I added discard_first for host side timing due to initialization cost. For cuda event and triton do_bench, reply on warmup should achieve the same effect. However, I do note with cuda_event first trial has always slightly higher time (with some overhead, not sure where it came from, still trying to play around); so for consistency I add discard_first.

Here's a 3rd-party verification of the timing numbers in this table, but timed on Modal's H200.

In my test, I adjusted _run_timing_smoke_test_matmul in src/unit_tests/test_eval_timing.py to return the time stats, and wrap it in a Modal App function like below

# Modal Infra
import modal
app = modal.App("timing_unit_test_modal")
timeout = 1800
cuda_version = "12.8.0"  # should be no greater than host CUDA version
flavor = "devel"  #  includes full CUDA toolkit
operating_sys = "ubuntu22.04"
tag = f"{cuda_version}-{flavor}-{operating_sys}"

image = (
    modal.Image.from_registry(f"nvidia/cuda:{tag}", add_python="3.10")
    .apt_install("git",
                "gcc-10",
                "g++-10",
                "clang" # note i skip a step
                )
    .pip_install_from_requirements(os.path.join(REPO_TOP_PATH, "requirements.txt"))
    .add_local_dir(
        KERNEL_BENCH_PATH,
        remote_path="/root/KernelBench"
    )
    .add_local_python_source("src")
)

@app.cls(image=image, scaledown_window=5)
class EvalFunc:

    @modal.method()
    def measure_program_time(self, *args, **kwargs):
        """
        Measure the time of a KernelBench reference architecture
        """
        try:
            stats_list = []
            timing_methods = ["cuda_event", "host_time", "do_bench", "do_bench_impl"]
            
            for timing_method in timing_methods:
                stats = _run_timing_smoke_test_matmul(timing_method)
                stats_list.append({"method": timing_method, "stats": stats})
            
            return stats_list
        
        except Exception as e:
            print(f"[Eval] Error in Measuring Performance: {e}")

I did 5 rounds of test on Modal H200 and reproduced the numbers in @simonguozirui 's table reasonably well, except cuda_event sometimes have slightly higher std and higher max, and the mean sometimes deviates from report more. But in general the numbers seem reasonably close to the local number.

simonguozirui · 2025-12-16T07:19:50Z

That's awesome @LeoXinhaoLee, thanks for trying this so fast!

I will add that modal script to our next PR for eval unit test! Thank you for your contribution and testing.

Note on modal, sometimes when you request H100 it might get you a variant like a H200 or H100 with different configs (@willhu-jpg @charlesfrye unless there are ways we can explicitly control for that?). In future PR, we will try to collect that metadata as part of the profiling result.

charlesfrye · 2025-12-16T13:14:40Z

"h100!" should always get you an H100 SXM and "h200" should always get you an H200

LeoXinhaoLee · 2025-12-16T21:10:38Z

"h100!" should always get you an H100 SXM and "h200" should always get you an H200

Hi @charlesfrye , thank you for sharing. I looked up this doc and seems like "h100!" is indeed used to prevent auto-upgrading to H200.

However, although the doc says all H100 on Modal are SXM, I have seen cases in which Modal gives me H100 NVL when I request by "h100". Would you know if "h100!" will strictly get H100 SXM instead of NVL as well?

simonguozirui · 2025-12-16T23:47:07Z

I will add the H100! request in a future PR to enforce that (and prevent the upgrade to H200 issue)

Start seperate timing file for timing functions

db46141

Co-authored-by: Simon Guo <[email protected]> Co-authored-by: Pietro Marsella <[email protected]>

simonguozirui assigned Marsella8, PaliC and simonguozirui Nov 10, 2025

simonguozirui added the enhancement New feature or request label Nov 10, 2025

Add tests, cache clearning, time, and do_bench

5ff8891

simonguozirui mentioned this pull request Nov 22, 2025

Revamped Eval Function #38

Open

simonguozirui added 2 commits December 12, 2025 03:29

reorganize timing func, migrate cuda event with l2 cache from branch …

9581487

…to here; a few other to implement

implement do_bench and cpu host timing, script to run all 4 timing me…

467f856

…thods

simonguozirui changed the title ~~[WiP] Multiple Timing Eval Implementation and Reogranization~~ [WiP] Multiple Timing Eval Implementation Dec 12, 2025

Pietro added 4 commits December 12, 2025 04:44

some annotations

920a793

run_and_check compatible

05d408f

Merge branch 'main' of github-simon:ScalingIntelligence/KernelBench i…

2e724c6

…nto eval_changes

revert eval and add only necessary changes

2be968a

simonguozirui changed the title ~~[WiP] Multiple Timing Eval Implementation~~ Multiple Timing Eval Implementation Dec 12, 2025

simonguozirui requested review from PaliC and alexzhang13 December 12, 2025 05:10

PaliC approved these changes Dec 12, 2025

View reviewed changes

src/unit_tests/test_eval_timing.py Show resolved Hide resolved

src/timing.py Show resolved Hide resolved

src/timing.py Show resolved Hide resolved

simonguozirui added 2 commits December 15, 2025 03:12

top_level eval entry point to set timing_method

936f221

remove discard_first for cuda event and updated documentation

2c36572

add discard_first for cuda_event

4909b1d

bkal01 reviewed Dec 16, 2025

View reviewed changes

src/unit_tests/test_eval_timing.py Outdated Show resolved Hide resolved

simonguozirui and others added 2 commits December 16, 2025 01:15

add device context for profile on particular device

6c92786

nit fix ready for merge

8a165d6

type annotation for device

c063b81

simonguozirui merged commit 737c1eb into main Dec 16, 2025

simonguozirui mentioned this pull request Dec 16, 2025

[Roadmap] Fall 2025 KernelBench Maintenance + Improvement Plan #74

Open

28 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multiple Timing Eval Implementation #89

Multiple Timing Eval Implementation #89

Uh oh!

simonguozirui commented Nov 10, 2025 •

edited

Loading

Uh oh!

PaliC left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

simonguozirui commented Dec 15, 2025

Uh oh!

Uh oh!

simonguozirui commented Dec 16, 2025

Uh oh!

simonguozirui commented Dec 16, 2025

Uh oh!

LeoXinhaoLee commented Dec 16, 2025

Uh oh!

simonguozirui commented Dec 16, 2025

Uh oh!

charlesfrye commented Dec 16, 2025

Uh oh!

LeoXinhaoLee commented Dec 16, 2025

Uh oh!

simonguozirui commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Multiple Timing Eval Implementation #89

Multiple Timing Eval Implementation #89

Uh oh!

Conversation

simonguozirui commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PaliC left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

simonguozirui commented Dec 15, 2025

Uh oh!

Uh oh!

simonguozirui commented Dec 16, 2025

Uh oh!

simonguozirui commented Dec 16, 2025

Uh oh!

LeoXinhaoLee commented Dec 16, 2025

Uh oh!

simonguozirui commented Dec 16, 2025

Uh oh!

charlesfrye commented Dec 16, 2025

Uh oh!

LeoXinhaoLee commented Dec 16, 2025

Uh oh!

simonguozirui commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

simonguozirui commented Nov 10, 2025 •

edited

Loading