Eval Unit Tests for Adversarial Eval Testing #82

bkal01 · 2025-11-04T23:33:15Z

adds unit tests for eval scripts

eval scripts should:

flag kernels that try to hack by modifying the input as incorrect
- tested by having a custom kernel zero out the inputs and return a zero matrix
- if the eval script runs the custom kernel first and then PyTorch (as done in here), then the custom kernel will incorrectly pass
- so, we should at least run the tests in both directions (as mentioned in the benchmarking doc) OR clone the inputs so any modifications the kernel makes doesn't affect the PyTorch run
flag kernels that try to hack by reusing PyTorch computations as incorrect
- tested by using empty which can get allocated the same physical memory as the PyTorch reference outputs
- if the eval script deletes/frees the PyTorch output object at some point before the custom kernel is run, the CUDA cache allocator might give that un-erased physical memory to the custom kernel and it will incorrectly pass
- so, we should ensure we zero out physical memory to prevent reuse
flag kernels that achieve excessive speedup as potentially reward hacked
- tested by having a custom kernel allocate a matmul to a non-default stream, achieving extremely unrealistic speedups when timed via CUDA events.
- eval script should time the reference kernel and flag this speedup as something the user should double check via the KernelExecResult metadata.

Adds a unit test to check that a generated kernel which modifies the original inputs fails the correctness check. For the square matmul problem, the kernel zeros out the inputs and returns a matrix of 0s. This will fail correctness/pass the test as long as the reference implementation is ran first. If we swap the order, the test will fail as the reference implementation will operate on tensors of 0s and it will look like the generated kernel computed the correct output.

Adds a unit test to check that a generated kernel which attempts to access the result from the PyTorch reference model in memory fails the correctness check. If a generated kernel uses empty_like, the CUDA caching allocator can re-use the physical memory of the previously computed result. All the kernel needs to do is return immediately and it will pass the correctness check. Note that in order to reproduce this, we need to copy the PyTorch output to the CPU and delete the output object. Then empty_like will fetch the physical memory for the output object.

…nto eval-unit-tests

use generic matmul shape for cache reuse adversarial kernel rather than requiring a square matmul.

…nto eval-unit-tests

make a non-blocking non-default stream, and use cublasGemmEx rather than at::matmul:

eval script now flags excessive speedups by timing pytorch reference.

simonguozirui · 2025-12-19T03:09:59Z

Thanks @bkal01 to create the adversarial kernel with additional cuda stream.. Now we have unit test and eval timing functions that only time the main cuda_stream might suffer from such attack, but we have added a heuristics way to check it (see if speedup is bigger than some threshold like 10x or 5x).

Here is an example using naive do_bench with the hacky stream kernel.

CUDA_VISIBLE_DEVICES=7 python src/unit_tests/test_eval_adversarial.py 
Running test adversarial kernel non_default_stream_kernel.py against problem level1/1_Square_matrix_multiplication_.py
[Profiling] Using timing method: do_bench
[WARNING] Excessive speedup 1947.37x over 10x threshold detected
[WARNING] Double check your kernel carefully to ensure it is not reward hacking.
compiled=True correctness=True metadata={'hardware': 'NVIDIA H200', 'device': '0', 'correctness_trials': '(5 / 5)', 'excessive_speedup': True} runtime=0.00304 runtime_stats={'mean': 0.00304, 'std': 0.000107, 'min': 0.00298, 'max': 0.00326, 'num_trials': 13} ref_runtime=5.92 ref_runtime_stats={'mean': 5.92, 'std': 1.14, 'min': 5.18, 'max': 7.69, 'num_trials': 17}
Traceback (most recent call last):
  File "/home/simon/kb-maintain/src/unit_tests/test_eval_adversarial.py", line 108, in <module>
    main()
  File "/home/simon/kb-maintain/src/unit_tests/test_eval_adversarial.py", line 105, in main
    test_non_default_stream()
  File "/home/simon/kb-maintain/src/unit_tests/test_eval_adversarial.py", line 96, in test_non_default_stream
    raise AssertionError(
AssertionError: Excessive speedup detected, Eval Function did not handle hacky stream

simonguozirui · 2025-12-19T07:45:03Z

We added an optional and gated logic in the eval functioneval_kernel_against_ref adds new param: check_for_excessive_speedup (bool), excessive_speedup_threshold (float), and now populates ref_runtime and ref_runtime_stats on the KernelExecResult. If the kernel's speedup exceeds the threshold, it sets result.metadata["excessive_speedup"] = True and prints a warning like this

[WARNING] Excessive speedup 1906.75x over 10x threshold detected
[WARNING] Double check your kernel carefully to ensure it is not reward hacking.

… other PRs)

simonguozirui · 2025-12-19T07:48:53Z

Tysm @bkal01 for the great work and being super careful. These unit tests would be super helpful for us to test the eval function with adversarial examples. Merging these for now but we can add more later.

Right now we added a simple excessive speedup check (heuristics like >5x, 10x) mark it as suspicious. A better approach is to create a SoL modeling (ongoing effort) based on program ops and hardware specs.

Also started to add the draft of eval / benchmarking guide here. @PaliC and team will pick up in other PRs.

bkal01 added 2 commits November 4, 2025 20:00

bkal01 force-pushed the eval-unit-tests branch from f24640f to 9097e65 Compare November 5, 2025 04:49

bkal01 changed the title ~~[WIP] add unit tests for input mod~~ add unit tests for input mod Nov 5, 2025

bkal01 requested a review from simonguozirui November 5, 2025 05:05

bkal01 changed the title ~~add unit tests for input mod~~ add eval unit tests Nov 7, 2025

simonguozirui mentioned this pull request Nov 22, 2025

Revamped Eval Function #38

Open

simonguozirui added 2 commits November 29, 2025 07:52

Merge branch 'main' of github-simon:ScalingIntelligence/KernelBench i…

2ed8033

…nto eval-unit-tests

test bhavesh's unit test

357957e

simonguozirui changed the title ~~add eval unit tests~~ Eval Unit Tests for Adversarial Correctness Testing Nov 29, 2025

use generic shape

5bb2679

use generic matmul shape for cache reuse adversarial kernel rather than requiring a square matmul.

bkal01 force-pushed the eval-unit-tests branch from c734bbc to 5bb2679 Compare December 1, 2025 21:23

simonguozirui mentioned this pull request Dec 10, 2025

[Roadmap] Fall 2025 KernelBench Maintenance + Improvement Plan #74

Open

28 tasks

excessive speedup unit test via non default stream

3d7ff72

bkal01 force-pushed the eval-unit-tests branch from dc02d5c to 3d7ff72 Compare December 13, 2025 00:57

simonguozirui mentioned this pull request Dec 17, 2025

Benchmarking guide #106

Open

simonguozirui and others added 6 commits December 18, 2025 02:42

Merge branch 'main' of github-simon:ScalingIntelligence/KernelBench i…

7e98616

…nto eval-unit-tests

update timing signature

eab8a88

update non default stream kernel

55eca8a

make a non-blocking non-default stream, and use cublasGemmEx rather than at::matmul:

reduce trial for adverseiral stream hack

ab9e61f

show unrealistic speedup

251a03e

flag excessive speedups in eval script

4fc9751

eval script now flags excessive speedups by timing pytorch reference.

simonguozirui and others added 2 commits December 19, 2025 03:13

reogranize a bit to flag timing on mian stream fails

99b9faa

update EVAL.md with unit test summary

4f51089

simonguozirui assigned bkal01 Dec 19, 2025

simonguozirui added the enhancement New feature or request label Dec 19, 2025

ready for merge, update guide a bit (more for sahan to keep adding in…

61085c5

… other PRs)

simonguozirui changed the title ~~Eval Unit Tests for Adversarial Correctness Testing~~ Eval Unit Tests for Adversarial Eval Testing Dec 19, 2025

simonguozirui merged commit fd57302 into main Dec 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval Unit Tests for Adversarial Eval Testing #82

Eval Unit Tests for Adversarial Eval Testing #82

Uh oh!

bkal01 commented Nov 4, 2025 •

edited

Loading

Uh oh!

simonguozirui commented Dec 19, 2025

Uh oh!

simonguozirui commented Dec 19, 2025

Uh oh!

simonguozirui commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Eval Unit Tests for Adversarial Eval Testing #82

Eval Unit Tests for Adversarial Eval Testing #82

Uh oh!

Conversation

bkal01 commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simonguozirui commented Dec 19, 2025

Uh oh!

simonguozirui commented Dec 19, 2025

Uh oh!

simonguozirui commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bkal01 commented Nov 4, 2025 •

edited

Loading