Skip to content

Conversation

@pythonomar22
Copy link
Collaborator

@pythonomar22 pythonomar22 commented Dec 1, 2025

No description provided.

…is.py

- Add baseline_file param to override default baseline path
- Add eval_results_dir param to override default runs directory
- Add output_file param to write results as JSON
- Return results dict from analyze_greedy_eval()
- All changes backward compatible (existing usage unchanged)
@pythonomar22 pythonomar22 force-pushed the leaderboard-analysis-json branch from 153e0cb to 87ba3c2 Compare December 29, 2025 02:45
@pythonomar22 pythonomar22 changed the title Add optional path parameters and JSON output to benchmark_eval_analys… updates for leaderboard! Dec 31, 2025
@pythonomar22
Copy link
Collaborator Author

this has just sort of become a PR that contains all the changes needed to work with the external leaderboard repo. a couple things here:

  1. i thought it'd be good to add the h100 baseline timings but produced on modal rather than the current lambda labs or cluster that we have now. this is because we want to keep things consistent for the leaderboard, as we evaluate kernels on modal, so we want to compare to a baseline thru modal.
  2. the leaderboard repo has kb as a submodule, and so benchmark eval analysis currently doesn't have like path inputs/outputs, as in it can't take in an arbitrary path file for the baseline file json and output the results to a provided path/file, so the cleanest approach here is to modify benchmark eval analysis to allow this.
  3. when running eval from generations, there are certain kernels that have import errors, and they wouldn't get logged to the eval_results.json, the output of eval from generations. this was sort of overlooked/allowed before since all of these kernels that don't get added to the results file were failures, but this is obviously not intended behavior, and so i fixed this and made sure all kernels get logged and their failures. also added functionality of specifying problem ids to run, this is not any different than subset from before but the way you would specify subset was kind of convoluted, i did it in a previous PR. this is cleaner, and you just specify problem_ids=['2', '5', ...] for ex.
  4. generate baseline time modal script was not updated with all the new uv and package restructuring changes (can reference PR here), and so i fixed a broken import. also it was doing things sequentially very slow in modal, and so i just added evaluator.spawn calls to parallelize things, the same approach we take in all of our other modal scripts.

@simonguozirui
Copy link
Collaborator

LGTM, tysm for the thoughtful PR @pythonomar22
For context, these are the steps to enable us to support the evaluation of frontier models on KernelBench. We don't plan to offer, but hope we can more reliably and reproducibly share SoTA model performance on KernelBench.

  1. H100 Modal baseline timings - Makes sense to have Modal-produced baselines since that's where leaderboard evaluations run. Keeps things consistent. I checked the json it is all H100 (tho Modal sometimes bump H100 to H200 but that in this PR, the results are all H100)

  2. benchmark_eval_analysis path overrides - I like the path overrides! baseline_file, eval_results_dir, and output_file params let the leaderboard repo specify arbitrary paths without modifying KB. The old way of fixing the path was when @willhu-jpg and I were experimenting for ICML last year.

  3. eval_from_generations fixes

  • I like the problem_id list, though we keep the subset tuple approach.
    Import error logging fix is important - now all kernels get logged to eval_results.json regardless of failure type. I added some annotations to document the inner vs outer catch levels.
  • The reason we didn't catch some of those outer-level evals when doing locally (namely, parallel sample monkey exp) was due to an even more outer-level eval retry logic. But super valid for logging these as failures in this script (esp. due to Modal infra issues, GPU attachment, network, multi-processing issues).
  1. generate_baseline_time_modal parallelization - .spawn() pattern matches our other Modal scripts. Much faster than sequential.

@simonguozirui simonguozirui merged commit 1f8d20f into main Dec 31, 2025
ethanboneh pushed a commit that referenced this pull request Jan 6, 2026
* Add optional path parameters and JSON output to benchmark_eval_analysis.py

- Add baseline_file param to override default baseline path
- Add eval_results_dir param to override default runs directory
- Add output_file param to write results as JSON
- Return results dict from analyze_greedy_eval()
- All changes backward compatible (existing usage unchanged)

* h100 modal timing, and some changes

* lgtm; nit small annotation

---------

Co-authored-by: Simon Guo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants