Conversation
Yuki-Imajuku
commented
Sep 21, 2025
- Add evaluation scripts
- Update version setting in pyproject.toml
|
@codex Please review this PR carefully |
There was a problem hiding this comment.
Pull Request Overview
This PR adds a comprehensive evaluation framework for Large Language Models (LLMs) on ALE-Bench, including supporting scripts and configuration files.
- Introduces the
ale_bench_evalpackage with modules for repeated sampling, self-refinement, and automated evaluation - Updates version configuration to use dynamic versioning from uv
- Fixes Docker container management issues in input generation and session closing
Reviewed Changes
Copilot reviewed 63 out of 64 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| src/ale_bench_eval/ | Complete evaluation framework with 12 new modules for LLM benchmarking |
| scripts/run_eval.sh | Bash script for automated evaluation execution |
| llm_configs/ | 26 JSON configuration files for various LLM models |
| docs/evaluation.md | Comprehensive documentation for the evaluation tool |
| pyproject.toml | Version configuration updates and new eval dependencies |
| src/ale_bench/ | Docker container management fixes |
| results/.gitignore | Results directory setup |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Codex Review: Here are some suggestions.
Reply with @codex fix comments to fix any unresolved comments.
About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".
|
@codex Review again. |
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 64 out of 65 changed files in this pull request and generated 10 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Codex Review: Here are some suggestions.
Reply with @codex fix comments to fix any unresolved comments.
About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".