-
Notifications
You must be signed in to change notification settings - Fork 110
Multiple Timing Eval Implementation #89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: Simon Guo <[email protected]> Co-authored-by: Pietro Marsella <[email protected]>
…to here; a few other to implement
PaliC
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not too sure what the point of discard_first is, otherwise this generally looks solid. I do think if we are only supporting cuda we can actually get rid of the do_bench implementation as at this point it's very similar to time with cuda event.
|
I timed with the new
All device-side timing methods ( Thanks @PaliC for review. I added |
|
Thank you for review @alexzhang13 @bkal01! and also for the feedback from @ngc92. Now @PaliC and I will keep writing the kernel profiling blogpost with these functions and continually update the eval / timing function with new timing related utilities. For cuda Events profiling method, this version assume no adversarial cuda stream (aka computation is done on the main stream). We will have a separate PR (with unit test from @bkal01 and @alexzhang13 @ngc92 suggestions) for a version that hopefully can guard against adversarial cuda stream. Merging this for now and we will have new PRs building on top of this! |
|
More on timing, a good reference is GPU mode's timing eval using cuda Event from their latest competitions Another helpful resource is this lecture from GPU mode on how to do kernel benchmarking! |
Here's a 3rd-party verification of the timing numbers in this table, but timed on Modal's H200. In my test, I adjusted I did 5 rounds of test on Modal H200 and reproduced the numbers in @simonguozirui 's table reasonably well, except
|
|
That's awesome @LeoXinhaoLee, thanks for trying this so fast! I will add that modal script to our next PR for eval unit test! Thank you for your contribution and testing. Note on modal, sometimes when you request H100 it might get you a variant like a H200 or H100 with different configs (@willhu-jpg @charlesfrye unless there are ways we can explicitly control for that?). In future PR, we will try to collect that metadata as part of the profiling result. |
|
"h100!" should always get you an H100 SXM and "h200" should always get you an H200 |
Hi @charlesfrye , thank you for sharing. I looked up this doc and seems like "h100!" is indeed used to prevent auto-upgrading to H200. However, although the doc says all H100 on Modal are SXM, I have seen cases in which Modal gives me H100 NVL when I request by "h100". Would you know if "h100!" will strictly get H100 SXM instead of NVL as well? |
|
I will add the H100! request in a future PR to enforce that (and prevent the upgrade to H200 issue) |

PR Change Overview
timingrelated functions to a separate file outside of evaltimingevaluation methodsWe will do another PR to further modularize the
evalfunction.Concretely, we implemented:
cuda_event– Device-side timing withtorch.cuda.event; measures kernel runtime on a single stream under cold L2 cache; used as the default kernel timing method as the original KernelBench.do_bench– Thin wrapper over Triton’s triton.testing.do_bench; uses Triton’s time-based warmup/rep logic and returns all runtimes, matching Triton’s native benchmarking semantics.do_bench_impl– Re-implementation of Triton do_bench that preserves its device-side timing but exposes explicit num_warmup / num_trials and aggressively clears Triton’s cache each trial.host_time– Host-side wall-clock timing viatime.perf_counter; measures end-to-end latency including Python, launch overhead, CUDA runtime, all streams, and OS scheduling; not recommended when you want pure device kernel time.More on do_bench from @PaliC
For
do_bench, https://triton-lang.org/main/python-api/generated/triton.testing.do_bench.html gives us what we want, but the function takes in (total) ms instead of iterations for calculating warmup reps and iterations. The changes simply copy over the function (which is fairly self contained) and overwrite the timing math with explicit iterations. Hence we implemented both.do_bench(as close to the original as possible as folks expect do_bench to act a certain way) and also an "do_bench like timing", which isdo_bench_impl, that has control over warmup reps and iterations reps.We also implement L2 Cache clearing via thrashing (shown in GPU mode PR and Triton runtime. We focus the eval timing on cold-cache performance.
They are all implemented with a unified interface. Also added top-level eval entry point argument to specify which
timing_methodto use.