updates for leaderboard! #100

pythonomar22 · 2025-12-01T04:04:43Z

No description provided.

…is.py - Add baseline_file param to override default baseline path - Add eval_results_dir param to override default runs directory - Add output_file param to write results as JSON - Return results dict from analyze_greedy_eval() - All changes backward compatible (existing usage unchanged)

pythonomar22 · 2025-12-31T03:49:10Z

this has just sort of become a PR that contains all the changes needed to work with the external leaderboard repo. a couple things here:

i thought it'd be good to add the h100 baseline timings but produced on modal rather than the current lambda labs or cluster that we have now. this is because we want to keep things consistent for the leaderboard, as we evaluate kernels on modal, so we want to compare to a baseline thru modal.
the leaderboard repo has kb as a submodule, and so benchmark eval analysis currently doesn't have like path inputs/outputs, as in it can't take in an arbitrary path file for the baseline file json and output the results to a provided path/file, so the cleanest approach here is to modify benchmark eval analysis to allow this.
when running eval from generations, there are certain kernels that have import errors, and they wouldn't get logged to the eval_results.json, the output of eval from generations. this was sort of overlooked/allowed before since all of these kernels that don't get added to the results file were failures, but this is obviously not intended behavior, and so i fixed this and made sure all kernels get logged and their failures. also added functionality of specifying problem ids to run, this is not any different than subset from before but the way you would specify subset was kind of convoluted, i did it in a previous PR. this is cleaner, and you just specify problem_ids=['2', '5', ...] for ex.
generate baseline time modal script was not updated with all the new uv and package restructuring changes (can reference PR here), and so i fixed a broken import. also it was doing things sequentially very slow in modal, and so i just added evaluator.spawn calls to parallelize things, the same approach we take in all of our other modal scripts.

simonguozirui · 2025-12-31T17:24:18Z

LGTM, tysm for the thoughtful PR @pythonomar22
For context, these are the steps to enable us to support the evaluation of frontier models on KernelBench. We don't plan to offer, but hope we can more reliably and reproducibly share SoTA model performance on KernelBench.

H100 Modal baseline timings - Makes sense to have Modal-produced baselines since that's where leaderboard evaluations run. Keeps things consistent. I checked the json it is all H100 (tho Modal sometimes bump H100 to H200 but that in this PR, the results are all H100)
benchmark_eval_analysis path overrides - I like the path overrides! baseline_file, eval_results_dir, and output_file params let the leaderboard repo specify arbitrary paths without modifying KB. The old way of fixing the path was when @willhu-jpg and I were experimenting for ICML last year.
eval_from_generations fixes

I like the problem_id list, though we keep the subset tuple approach.
Import error logging fix is important - now all kernels get logged to eval_results.json regardless of failure type. I added some annotations to document the inner vs outer catch levels.
The reason we didn't catch some of those outer-level evals when doing locally (namely, parallel sample monkey exp) was due to an even more outer-level eval retry logic. But super valid for logging these as failures in this script (esp. due to Modal infra issues, GPU attachment, network, multi-processing issues).

generate_baseline_time_modal parallelization - .spawn() pattern matches our other Modal scripts. Much faster than sequential.

* Add optional path parameters and JSON output to benchmark_eval_analysis.py - Add baseline_file param to override default baseline path - Add eval_results_dir param to override default runs directory - Add output_file param to write results as JSON - Return results dict from analyze_greedy_eval() - All changes backward compatible (existing usage unchanged) * h100 modal timing, and some changes * lgtm; nit small annotation --------- Co-authored-by: Simon Guo <[email protected]>

pythonomar22 force-pushed the leaderboard-analysis-json branch from 153e0cb to 87ba3c2 Compare December 29, 2025 02:45

h100 modal timing, and some changes

6e6fca9

pythonomar22 changed the title ~~Add optional path parameters and JSON output to benchmark_eval_analys…~~ updates for leaderboard! Dec 31, 2025

simonguozirui requested a review from willhu-jpg December 31, 2025 16:54

lgtm; nit small annotation

764de71

simonguozirui merged commit 1f8d20f into main Dec 31, 2025

simonguozirui mentioned this pull request Dec 31, 2025

[Roadmap] Fall 2025 KernelBench Maintenance + Improvement Plan #74

Open

28 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

updates for leaderboard! #100

updates for leaderboard! #100

pythonomar22 commented Dec 1, 2025 •

edited

Loading

Uh oh!

pythonomar22 commented Dec 31, 2025

Uh oh!

simonguozirui commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

updates for leaderboard! #100

updates for leaderboard! #100

Conversation

pythonomar22 commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pythonomar22 commented Dec 31, 2025

Uh oh!

simonguozirui commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pythonomar22 commented Dec 1, 2025 •

edited

Loading