Skip to content

Conversation

@simonguozirui
Copy link
Collaborator

@simonguozirui simonguozirui commented Nov 10, 2025

PR Change Overview

  • Move timing related functions to a separate file outside of eval
  • Implement multiple timing evaluation methods
  • This would enable the kernel timing blog with @PaliC
  • This will also help inform best practices for kernel timing

We will do another PR to further modularize the eval function.

Concretely, we implemented:

  • cuda_event – Device-side timing with torch.cuda.event; measures kernel runtime on a single stream under cold L2 cache; used as the default kernel timing method as the original KernelBench.
  • do_bench – Thin wrapper over Triton’s triton.testing.do_bench; uses Triton’s time-based warmup/rep logic and returns all runtimes, matching Triton’s native benchmarking semantics.
  • do_bench_impl – Re-implementation of Triton do_bench that preserves its device-side timing but exposes explicit num_warmup / num_trials and aggressively clears Triton’s cache each trial.
  • host_time – Host-side wall-clock timing via time.perf_counter; measures end-to-end latency including Python, launch overhead, CUDA runtime, all streams, and OS scheduling; not recommended when you want pure device kernel time.

More on do_bench from @PaliC
For do_bench, https://triton-lang.org/main/python-api/generated/triton.testing.do_bench.html gives us what we want, but the function takes in (total) ms instead of iterations for calculating warmup reps and iterations. The changes simply copy over the function (which is fairly self contained) and overwrite the timing math with explicit iterations. Hence we implemented both. do_bench(as close to the original as possible as folks expect do_bench to act a certain way) and also an "do_bench like timing", which is do_bench_impl, that has control over warmup reps and iterations reps.

We also implement L2 Cache clearing via thrashing (shown in GPU mode PR and Triton runtime. We focus the eval timing on cold-cache performance.

They are all implemented with a unified interface. Also added top-level eval entry point argument to specify which timing_method to use.

Co-authored-by: Simon Guo <[email protected]>
Co-authored-by: Pietro Marsella <[email protected]>
@simonguozirui simonguozirui added the enhancement New feature or request label Nov 10, 2025
@simonguozirui simonguozirui changed the title [WiP] Multiple Timing Eval Implementation and Reogranization [WiP] Multiple Timing Eval Implementation Dec 12, 2025
@simonguozirui simonguozirui changed the title [WiP] Multiple Timing Eval Implementation Multiple Timing Eval Implementation Dec 12, 2025
Copy link
Collaborator

@PaliC PaliC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too sure what the point of discard_first is, otherwise this generally looks solid. I do think if we are only supporting cuda we can actually get rid of the do_bench implementation as at this point it's very similar to time with cuda event.

@simonguozirui
Copy link
Collaborator Author

I timed with the new test_eval_timing.py, do_bench and cuda_event are similar on H200, (M,N,K=2048) matmul on NVIDIA H200 with 5 warmup iterations and 100 reps (cannot specify for do_bench as it is adaptive reps)

Timing Method Mean (ms) Std Min Max Trials
cuda_event 0.344 0.000636 0.343 0.346 100
host_time 0.404 0.000803 0.401 0.406 100
do_bench 0.343 0.00036 0.343 0.344 190
do_bench_impl 0.343 0.000272 0.343 0.344 100

All device-side timing methods (cuda_event, do_bench, do_bench_impl) agree at ~0.343-0.344ms with low variance. host_time shows the expected ~0.06ms overhead from host side overhead / synchronization.

Thanks @PaliC for review. I added discard_first for host side timing due to initialization cost. For cuda event and triton do_bench, reply on warmup should achieve the same effect. However, I do note with cuda_event first trial has always slightly higher time (with some overhead, not sure where it came from, still trying to play around); so for consistency I add discard_first.

@simonguozirui
Copy link
Collaborator Author

Thank you for review @alexzhang13 @bkal01! and also for the feedback from @ngc92.
Shoutout to @Marsella8 for helping and the refactor (great job doing research this quarter)!

Now @PaliC and I will keep writing the kernel profiling blogpost with these functions and continually update the eval / timing function with new timing related utilities.

For cuda Events profiling method, this version assume no adversarial cuda stream (aka computation is done on the main stream). We will have a separate PR (with unit test from @bkal01 and @alexzhang13 @ngc92 suggestions) for a version that hopefully can guard against adversarial cuda stream.

Merging this for now and we will have new PRs building on top of this!

@simonguozirui simonguozirui merged commit 737c1eb into main Dec 16, 2025
@simonguozirui
Copy link
Collaborator Author

More on timing, a good reference is GPU mode's timing eval using cuda Event from their latest competitions
See their implemenation link

Another helpful resource is this lecture from GPU mode on how to do kernel benchmarking!

@LeoXinhaoLee
Copy link

I timed with the new test_eval_timing.py, do_bench and cuda_event are similar on H200, (M,N,K=2048) matmul on NVIDIA H200 with 5 warmup iterations and 100 reps (cannot specify for do_bench as it is adaptive reps)

Timing Method Mean (ms) Std Min Max Trials
cuda_event 0.344 0.000636 0.343 0.346 100
host_time 0.404 0.000803 0.401 0.406 100
do_bench 0.343 0.00036 0.343 0.344 190
do_bench_impl 0.343 0.000272 0.343 0.344 100
All device-side timing methods (cuda_event, do_bench, do_bench_impl) agree at ~0.343-0.344ms with low variance. host_time shows the expected ~0.06ms overhead from host side overhead / synchronization.

Thanks @PaliC for review. I added discard_first for host side timing due to initialization cost. For cuda event and triton do_bench, reply on warmup should achieve the same effect. However, I do note with cuda_event first trial has always slightly higher time (with some overhead, not sure where it came from, still trying to play around); so for consistency I add discard_first.

Here's a 3rd-party verification of the timing numbers in this table, but timed on Modal's H200.

In my test, I adjusted _run_timing_smoke_test_matmul in src/unit_tests/test_eval_timing.py to return the time stats, and wrap it in a Modal App function like below

# Modal Infra
import modal
app = modal.App("timing_unit_test_modal")
timeout = 1800
cuda_version = "12.8.0"  # should be no greater than host CUDA version
flavor = "devel"  #  includes full CUDA toolkit
operating_sys = "ubuntu22.04"
tag = f"{cuda_version}-{flavor}-{operating_sys}"

image = (
    modal.Image.from_registry(f"nvidia/cuda:{tag}", add_python="3.10")
    .apt_install("git",
                "gcc-10",
                "g++-10",
                "clang" # note i skip a step
                )
    .pip_install_from_requirements(os.path.join(REPO_TOP_PATH, "requirements.txt"))
    .add_local_dir(
        KERNEL_BENCH_PATH,
        remote_path="/root/KernelBench"
    )
    .add_local_python_source("src")
)

@app.cls(image=image, scaledown_window=5)
class EvalFunc:

    @modal.method()
    def measure_program_time(self, *args, **kwargs):
        """
        Measure the time of a KernelBench reference architecture
        """
        try:
            stats_list = []
            timing_methods = ["cuda_event", "host_time", "do_bench", "do_bench_impl"]
            
            for timing_method in timing_methods:
                stats = _run_timing_smoke_test_matmul(timing_method)
                stats_list.append({"method": timing_method, "stats": stats})
            
            return stats_list
        
        except Exception as e:
            print(f"[Eval] Error in Measuring Performance: {e}")

I did 5 rounds of test on Modal H200 and reproduced the numbers in @simonguozirui 's table reasonably well, except cuda_event sometimes have slightly higher std and higher max, and the mean sometimes deviates from report more. But in general the numbers seem reasonably close to the local number.

image

@simonguozirui
Copy link
Collaborator Author

That's awesome @LeoXinhaoLee, thanks for trying this so fast!

I will add that modal script to our next PR for eval unit test! Thank you for your contribution and testing.

Note on modal, sometimes when you request H100 it might get you a variant like a H200 or H100 with different configs (@willhu-jpg @charlesfrye unless there are ways we can explicitly control for that?). In future PR, we will try to collect that metadata as part of the profiling result.

@charlesfrye
Copy link

"h100!" should always get you an H100 SXM and "h200" should always get you an H200

@LeoXinhaoLee
Copy link

"h100!" should always get you an H100 SXM and "h200" should always get you an H200

Hi @charlesfrye , thank you for sharing. I looked up this doc and seems like "h100!" is indeed used to prevent auto-upgrading to H200.

However, although the doc says all H100 on Modal are SXM, I have seen cases in which Modal gives me H100 NVL when I request by "h100". Would you know if "h100!" will strictly get H100 SXM instead of NVL as well?

@simonguozirui
Copy link
Collaborator Author

I will add the H100! request in a future PR to enforce that (and prevent the upgrade to H200 issue)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants