Skip to content

examples : add llama-eval#21152

Draft
ggerganov wants to merge 51 commits intomasterfrom
gg/scripts-eval
Draft

examples : add llama-eval#21152
ggerganov wants to merge 51 commits intomasterfrom
gg/scripts-eval

Conversation

@ggerganov
Copy link
Copy Markdown
Member

@ggerganov ggerganov commented Mar 29, 2026

Overview

ref #18195
cont #18892

Adds a lean and mean evaluation tool:

  • Single Python script
  • Datasets: AIME, AIME2025, GSM8K, GPQA
  • Graders: regex, llm, custom
  • Store evaluation state in a json file
  • Realtime results
  • Output to stdout and HTML (with reasoning traces)
  • Supports stop/resume

Sample usage:

# start a new AIME25 evaluation of gpt-oss-20b (low) using gpt-oss-20b (medium) as grader
python3 llama-eval.py \
  --server http://127.0.0.1:8013 --model gpt-oss-20b-hf-low  \
  --judge-server http://127.0.0.1:9013 --judge-model gpt-oss-20b-hf-medium \
  --grader-type llm --dataset aime2025 --n_cases 240 \
  --temperature 1.0 --top-k 0 --top-p 1.0 --min-p 0.01 --threads 240 \
  --output aime2025-gpt-oss-20b-low-x8.json --seed 1234

# corresponding llama-server that will perform the computation
# note: no need for checkpoints and prompt caching
# note: for most evals you need at least -np 8 for reasonable eval time
./bin/llama-server \
  -hf ggml-org/gpt-oss-20b-GGUF -c 4194304 -np 256 \
  --port 8013 --host 0.0.0.0 \
  -cram 0 --ctx-checkpoints 0 \
  --chat-template-kwargs '{"reasoning_effort": "low"}'

# grader on port 9013
./bin/llama-server \
  -hf ggml-org/gpt-oss-20b-GGUF -c 32768 -np 1 \
  --port 9013 --host 0.0.0.0 \
  --chat-template-kwargs '{"reasoning_effort": "medium"}'

Sample results:

CLI
Loading AIME2025 dataset...
AIME2025 dataset loaded: 15 questions
Loading AIME2025 dataset (part 2)...
AIME2025 dataset loaded: 30 questions (total)

Tasks:
  Task ID             Dataset  Prompt (first 40 chars)                        Expected    Answer       Tokens  Status
  aime2025_000_020     AIME2025   Circle $\omega_1$ with radius 6 centered at...    293        N/A        N/A    pending
  aime2025_000_006     AIME2025   The twelve letters $A,B,C,D,E,F,G,H,I,J,K$,...    821        N/A        N/A    pending
  aime2025_000_008     AIME2025   The parabola with equation $y=x^{2}-4$ is r...    62         N/A        N/A    pending
  aime2025_000_004     AIME2025   There are $8!=40320$ eight-digit positive i...    279        N/A        N/A    pending
  aime2025_000_015     AIME2025   Six points $ A, B, C, D, E, $ and $ F $ lie...    468        N/A        N/A    pending
  aime2025_000_028     AIME2025   Let $ \triangle ABC $ be a right triangle w...    104        N/A        N/A    pending
  aime2025_000_013     AIME2025   Let $ABCDE$ be a convex pentagon with $AB=1...    60         N/A        N/A    pending
  aime2025_000_023     AIME2025   There are $ n $ values of $ x $ in the inte...    149        N/A        N/A    pending
  aime2025_000_022     AIME2025   From an unlimited supply of 1-cent coins, 1...    610        N/A        N/A    pending
  aime2025_000_019     AIME2025   Suppose $ \triangle ABC $ has angles $ \ang...    336        N/A        N/A    pending
  aime2025_000_012     AIME2025   Alex divides a disk into four quadrants wit...    204        N/A        N/A    pending
  aime2025_000_029     AIME2025   There are exactly three positive real numbe...    240        N/A        N/A    pending
  aime2025_000_009     AIME2025   The 27 cells of a $3\times9$ grid are fille...    81         N/A        N/A    pending
  aime2025_000_010     AIME2025   A piecewise linear periodic function is def...    259        N/A        N/A    pending
  aime2025_000_005     AIME2025   An isosceles trapezoid has an inscribed cir...    504        N/A        N/A    pending
  aime2025_000_016     AIME2025   Find the sum of all positive integers $ n $...    49         N/A        N/A    pending

Processing 240 AIME2025 tasks ...
Server: http://192.168.1.62:8014 (model: gpt-oss-20b-hf-low)
Grader: llm
Threads: 240
Sampling: temp=1.0, top-k=0, top-p=1.0, min-p=0.01
  1/240  aime2025_007_011     AIME2025   The set of points in 3-dimensional coordina...    510        78         157    ✗  [  0/  1, 0.000]
  2/240  aime2025_004_013     AIME2025   Let $ABCDE$ be a convex pentagon with $AB=1...    60         96         250    ✗  [  0/  2, 0.000]
  3/240  aime2025_005_023     AIME2025   There are $ n $ values of $ x $ in the inte...    149        N/A        N/A    ✗  [  0/  3, 0.000]
  4/240  aime2025_001_013     AIME2025   Let $ABCDE$ be a convex pentagon with $AB=1...    60         33         385    ✗  [  0/  4, 0.000]
  5/240  aime2025_002_011     AIME2025   The set of points in 3-dimensional coordina...    510        28         428    ✗  [  0/  5, 0.000]
  6/240  aime2025_006_013     AIME2025   Let $ABCDE$ be a convex pentagon with $AB=1...    60         42         515    ✗  [  0/  6, 0.000]
  7/240  aime2025_006_019     AIME2025   Suppose $ \triangle ABC $ has angles $ \ang...    336        336        530    ✓  [  1/  7, 0.143]
  8/240  aime2025_006_002     AIME2025   The 9 members of a baseball team went to an...    16         16         541    ✓  [  2/  8, 0.250]
  9/240  aime2025_007_029     AIME2025   There are exactly three positive real numbe...    240        0          573    ✗  [  2/  9, 0.222]
 10/240  aime2025_007_000     AIME2025   Find the sum of all integer bases $b>9$ for...    70         70         577    ✓  [  3/ 10, 0.300]
 11/240  aime2025_000_000     AIME2025   Find the sum of all integer bases $b>9$ for...    70         70         558    ✓  [  4/ 11, 0.364]
 12/240  aime2025_003_000     AIME2025   Find the sum of all integer bases $b>9$ for...    70         70         590    ✓  [  5/ 12, 0.417]
 ...

Session time: 3454.6s | Total accumulated time: 3454.6s
============================================================
Results: 91/240 correct (37.9%)
============================================================

Eval state dumped to aime2025-gpt-oss-20b-low-x8.json

HTML: llama-eval-state-aime2025-gpt-oss-120b-high-x4.json.html

image

Additional information

I've been vibe coding this from time to time using local models and OpenCode. Given that I don't write Python, I would guess the quality of the implementation is quite poor. Thought I've tried to keep it minimalistic. The current implementation is almost feature complete given what I initially imagined. But haven't found the time to wrap it up yet completely. If anyone is interested in helping, feel free to PR to this branch.

TODOs:

  • Speed tracking (tok/s)
  • Support passing multiple evaluation servers in order to distribute the eval tasks to more machines
  • Better (i.e. simpler) HTML layout. Easier to read results
  • Result uncertainty estimate
  • Unslop

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES, OpenCode + Qwen3 30B Coder, GLM 4.7 Flash, MiniMax M2.5

gatbontonpc and others added 30 commits February 15, 2026 21:08
Add a standalone Python script that simulates a llama-server HTTP endpoint
for testing the eval script. The simulator:

- Implements /v1/chat/completions endpoint with OpenAI-compatible format
- Loads AIME dataset from HuggingFace with local caching
- Uses Levenshtein distance for intelligent question matching
- Supports configurable success rate for correct/wrong answer generation
- Provides debug logging for troubleshooting

Also includes test scripts and documentation for testing and understanding
the simulator functionality.
Extract repeating question string into TEST_QUESTION variable and
create make_request() helper function to reduce code duplication.
Add proper error handling for error responses.
Add summary of llama-server-simulator implementation work including
features, testing results, technical decisions, and refactoring.
- Create new simplified evaluation script focused only on AIME
- Implement EvalState and Processor dataclasses for structured state management
- Add real-time feedback showing correct/incorrect status per case
- Abstract grading interface for external grader support
- Use structured JSON output for eval state
- Apply HuggingFace dataset caching to avoid repeated downloads
- Remove Levenshtein matching - eval script only sends requests and validates answers
- Add Grader class supporting regex and CLI-based grading
- Implement built-in regex patterns for AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande
- Add CLI grader interface: python script.py --answer <pred> --expected <gold>
- Add HF telemetry disable to avoid warnings
- Support exact match requirement for regex patterns
- Add 30-second timeout for CLI grader
- Handle both boxed and plain text formats for AIME answers
- Add ThreadPoolExecutor for parallel request processing controlled by --threads
- Add --model argument to specify model name in request data
- Refactor process() to use thread-safe _process_single_case() method
- Update progress tracking to work with concurrent execution
…ter updates

- Add threading support implementation details
- Document ThreadPoolExecutor usage and thread safety
- Add model parameter implementation details
- Include testing results for both features
@strawberrymelonpanda
Copy link
Copy Markdown
Contributor

strawberrymelonpanda commented Mar 29, 2026

evaluation of gpt-oss-20b (low) using gpt-oss-20b (medium) as grader

Love the idea, but can you really trust the same 20B model to grade itself?
It's been awhile, but my own experiments with LLM grading have never been satisfactory.

I liked that #18892 seemed to be simple pass/fail, unless I've overlooked something.

@ggerganov
Copy link
Copy Markdown
Member Author

The script also supports regex-based grader. Also a custom grader with your own script.

Generally, when using regex grading, I've seen quite a few false-negatives even when using the original gpt-oss sophisticated regexes.

With the current gpt-oss grader I haven't observed false-positives yet. Ideally, you would want to use gpt-oss-120b just to make sure. Though I think that the task of extracting a number from a paragraph of text should be solvable with gpt-oss-20b quite robustly.

Still if you spot a failure, please do report.

@strawberrymelonpanda
Copy link
Copy Markdown
Contributor

strawberrymelonpanda commented Mar 29, 2026

Though I think that the task of extracting a number from a paragraph of text should be solvable with gpt-oss-20b quite robustly.

Fair. I'd also be curious what the minimum viable model for the task is. i.e., can Qwen 3.5 4B solve it reliably?
Something to tinker with.

I'll certainly pull the branch, but hoping this one makes it to a mainline tool. 😄

@strawberrymelonpanda
Copy link
Copy Markdown
Contributor

strawberrymelonpanda commented Mar 29, 2026

The script also supports regex-based grader. Also a custom grader with your own script.
Generally, when using regex grading, I've seen quite a few false-negatives

I wonder if a "hybrid" option could cut down the eval time by only checking the false results, as a double-check. Seems like false-passes would be more rare.

Depending on the task that might not make a huge difference, when pass rates are well below 50%, but just musing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants