examples : add llama-eval by ggerganov · Pull Request #21152 · ggml-org/llama.cpp

ggerganov · 2026-03-29T14:52:33Z

Overview

Adds a lean and mean evaluation tool:

Single Python script
Datasets: AIME, AIME2025, GSM8K, GPQA
Graders: regex, llm, custom
Store evaluation state in a json file
Realtime results
Output to stdout and HTML (with reasoning traces)
Supports stop/resume

Sample usage:

# start a new AIME25 evaluation of gpt-oss-20b (low) using gpt-oss-20b (medium) as grader
python3 llama-eval.py \
  --server http://127.0.0.1:8013 --model gpt-oss-20b-hf-low  \
  --judge-server http://127.0.0.1:9013 --judge-model gpt-oss-20b-hf-medium \
  --grader-type llm --dataset aime2025 --n_cases 240 \
  --temperature 1.0 --top-k 0 --top-p 1.0 --min-p 0.01 --threads 240 \
  --output aime2025-gpt-oss-20b-low-x8.json --seed 1234

# corresponding llama-server that will perform the computation
# note: no need for checkpoints and prompt caching
# note: for most evals you need at least -np 8 for reasonable eval time
./bin/llama-server \
  -hf ggml-org/gpt-oss-20b-GGUF -c 4194304 -np 256 \
  --port 8013 --host 0.0.0.0 \
  -cram 0 --ctx-checkpoints 0 \
  --chat-template-kwargs '{"reasoning_effort": "low"}'

# grader on port 9013
./bin/llama-server \
  -hf ggml-org/gpt-oss-20b-GGUF -c 32768 -np 1 \
  --port 9013 --host 0.0.0.0 \
  --chat-template-kwargs '{"reasoning_effort": "medium"}'

Sample results:

CLI

Loading AIME2025 dataset...
AIME2025 dataset loaded: 15 questions
Loading AIME2025 dataset (part 2)...
AIME2025 dataset loaded: 30 questions (total)

Tasks:
  Task ID             Dataset  Prompt (first 40 chars)                        Expected    Answer       Tokens  Status
  aime2025_000_020     AIME2025   Circle $\omega_1$ with radius 6 centered at...    293        N/A        N/A    pending
  aime2025_000_006     AIME2025   The twelve letters $A,B,C,D,E,F,G,H,I,J,K$,...    821        N/A        N/A    pending
  aime2025_000_008     AIME2025   The parabola with equation $y=x^{2}-4$ is r...    62         N/A        N/A    pending
  aime2025_000_004     AIME2025   There are $8!=40320$ eight-digit positive i...    279        N/A        N/A    pending
  aime2025_000_015     AIME2025   Six points $ A, B, C, D, E, $ and $ F $ lie...    468        N/A        N/A    pending
  aime2025_000_028     AIME2025   Let $ \triangle ABC $ be a right triangle w...    104        N/A        N/A    pending
  aime2025_000_013     AIME2025   Let $ABCDE$ be a convex pentagon with $AB=1...    60         N/A        N/A    pending
  aime2025_000_023     AIME2025   There are $ n $ values of $ x $ in the inte...    149        N/A        N/A    pending
  aime2025_000_022     AIME2025   From an unlimited supply of 1-cent coins, 1...    610        N/A        N/A    pending
  aime2025_000_019     AIME2025   Suppose $ \triangle ABC $ has angles $ \ang...    336        N/A        N/A    pending
  aime2025_000_012     AIME2025   Alex divides a disk into four quadrants wit...    204        N/A        N/A    pending
  aime2025_000_029     AIME2025   There are exactly three positive real numbe...    240        N/A        N/A    pending
  aime2025_000_009     AIME2025   The 27 cells of a $3\times9$ grid are fille...    81         N/A        N/A    pending
  aime2025_000_010     AIME2025   A piecewise linear periodic function is def...    259        N/A        N/A    pending
  aime2025_000_005     AIME2025   An isosceles trapezoid has an inscribed cir...    504        N/A        N/A    pending
  aime2025_000_016     AIME2025   Find the sum of all positive integers $ n $...    49         N/A        N/A    pending

Processing 240 AIME2025 tasks ...
Server: http://192.168.1.62:8014 (model: gpt-oss-20b-hf-low)
Grader: llm
Threads: 240
Sampling: temp=1.0, top-k=0, top-p=1.0, min-p=0.01
  1/240  aime2025_007_011     AIME2025   The set of points in 3-dimensional coordina...    510        78         157    ✗  [  0/  1, 0.000]
  2/240  aime2025_004_013     AIME2025   Let $ABCDE$ be a convex pentagon with $AB=1...    60         96         250    ✗  [  0/  2, 0.000]
  3/240  aime2025_005_023     AIME2025   There are $ n $ values of $ x $ in the inte...    149        N/A        N/A    ✗  [  0/  3, 0.000]
  4/240  aime2025_001_013     AIME2025   Let $ABCDE$ be a convex pentagon with $AB=1...    60         33         385    ✗  [  0/  4, 0.000]
  5/240  aime2025_002_011     AIME2025   The set of points in 3-dimensional coordina...    510        28         428    ✗  [  0/  5, 0.000]
  6/240  aime2025_006_013     AIME2025   Let $ABCDE$ be a convex pentagon with $AB=1...    60         42         515    ✗  [  0/  6, 0.000]
  7/240  aime2025_006_019     AIME2025   Suppose $ \triangle ABC $ has angles $ \ang...    336        336        530    ✓  [  1/  7, 0.143]
  8/240  aime2025_006_002     AIME2025   The 9 members of a baseball team went to an...    16         16         541    ✓  [  2/  8, 0.250]
  9/240  aime2025_007_029     AIME2025   There are exactly three positive real numbe...    240        0          573    ✗  [  2/  9, 0.222]
 10/240  aime2025_007_000     AIME2025   Find the sum of all integer bases $b>9$ for...    70         70         577    ✓  [  3/ 10, 0.300]
 11/240  aime2025_000_000     AIME2025   Find the sum of all integer bases $b>9$ for...    70         70         558    ✓  [  4/ 11, 0.364]
 12/240  aime2025_003_000     AIME2025   Find the sum of all integer bases $b>9$ for...    70         70         590    ✓  [  5/ 12, 0.417]
 ...

Session time: 3454.6s | Total accumulated time: 3454.6s
============================================================
Results: 91/240 correct (37.9%)
============================================================

Eval state dumped to aime2025-gpt-oss-20b-low-x8.json

HTML: llama-eval-state-aime2025-gpt-oss-120b-high-x4.json.html

Additional information

I've been vibe coding this from time to time using local models and OpenCode. Given that I don't write Python, I would guess the quality of the implementation is quite poor. Thought I've tried to keep it minimalistic. The current implementation is almost feature complete given what I initially imagined. But haven't found the time to wrap it up yet completely. If anyone is interested in helping, feel free to PR to this branch.

TODOs:

Speed tracking (tok/s)
Support passing multiple evaluation servers in order to distribute the eval tasks to more machines
Better (i.e. simpler) HTML layout. Easier to read results
Result uncertainty estimate
Unslop

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES, OpenCode + Qwen3 30B Coder, GLM 4.7 Flash, MiniMax M2.5

Add a standalone Python script that simulates a llama-server HTTP endpoint for testing the eval script. The simulator: - Implements /v1/chat/completions endpoint with OpenAI-compatible format - Loads AIME dataset from HuggingFace with local caching - Uses Levenshtein distance for intelligent question matching - Supports configurable success rate for correct/wrong answer generation - Provides debug logging for troubleshooting Also includes test scripts and documentation for testing and understanding the simulator functionality.

Extract repeating question string into TEST_QUESTION variable and create make_request() helper function to reduce code duplication. Add proper error handling for error responses.

Add summary of llama-server-simulator implementation work including features, testing results, technical decisions, and refactoring.

- Create new simplified evaluation script focused only on AIME - Implement EvalState and Processor dataclasses for structured state management - Add real-time feedback showing correct/incorrect status per case - Abstract grading interface for external grader support - Use structured JSON output for eval state - Apply HuggingFace dataset caching to avoid repeated downloads - Remove Levenshtein matching - eval script only sends requests and validates answers

- Add Grader class supporting regex and CLI-based grading - Implement built-in regex patterns for AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande - Add CLI grader interface: python script.py --answer <pred> --expected <gold> - Add HF telemetry disable to avoid warnings - Support exact match requirement for regex patterns - Add 30-second timeout for CLI grader - Handle both boxed and plain text formats for AIME answers

- Add ThreadPoolExecutor for parallel request processing controlled by --threads - Add --model argument to specify model name in request data - Refactor process() to use thread-safe _process_single_case() method - Update progress tracking to work with concurrent execution

…ter updates - Add threading support implementation details - Document ThreadPoolExecutor usage and thread safety - Add model parameter implementation details - Include testing results for both features

strawberrymelonpanda · 2026-03-29T20:17:38Z

evaluation of gpt-oss-20b (low) using gpt-oss-20b (medium) as grader

Love the idea, but can you really trust the same 20B model to grade itself?
It's been awhile, but my own experiments with LLM grading have never been satisfactory.

I liked that #18892 seemed to be simple pass/fail, unless I've overlooked something.

ggerganov · 2026-03-29T20:29:37Z

The script also supports regex-based grader. Also a custom grader with your own script.

Generally, when using regex grading, I've seen quite a few false-negatives even when using the original gpt-oss sophisticated regexes.

With the current gpt-oss grader I haven't observed false-positives yet. Ideally, you would want to use gpt-oss-120b just to make sure. Though I think that the task of extracting a number from a paragraph of text should be solvable with gpt-oss-20b quite robustly.

Still if you spot a failure, please do report.

strawberrymelonpanda · 2026-03-29T20:36:13Z

Though I think that the task of extracting a number from a paragraph of text should be solvable with gpt-oss-20b quite robustly.

Fair. I'd also be curious what the minimum viable model for the task is. i.e., can Qwen 3.5 4B solve it reliably?
Something to tinker with.

I'll certainly pull the branch, but hoping this one makes it to a mainline tool. 😄

strawberrymelonpanda · 2026-03-29T20:42:29Z

The script also supports regex-based grader. Also a custom grader with your own script.
Generally, when using regex grading, I've seen quite a few false-negatives

I wonder if a "hybrid" option could cut down the eval time by only checking the false results, as a double-check. Seems like false-passes would be more rare.

Depending on the task that might not make a huge difference, when pass rates are well below 50%, but just musing.

gatbontonpc and others added 30 commits February 15, 2026 21:08

working llama-eval mc and math suite

c05df17

multi source llama-eval

c2d83ca

Add readme

89cab3d

add checkpointing

8839037

examples: refactor test-simulator.sh for better readability

23d4e21

Extract repeating question string into TEST_QUESTION variable and create make_request() helper function to reduce code duplication. Add proper error handling for error responses.

docs: update llama-eval-discussion.md with session work summary

c87af1d

Add summary of llama-server-simulator implementation work including features, testing results, technical decisions, and refactoring.

docs: remove README.md from llama-eval

a80814e

examples: use HF_HUB_OFFLINE to avoid HF Hub warnings

9453f9d

examples: remove HF_HUB_OFFLINE to allow dataset download

87f8930

examples: use cached dataset path to avoid HF Hub requests

c2619c1

examples: use cached dataset path in simulator to avoid HF Hub requests

04f6872

docs: update llama-eval-discussion.md with session work summary

37b26ca

docs: update llama-eval-discussion.md with threading and model parame…

a939f4c

…ter updates - Add threading support implementation details - Document ThreadPoolExecutor usage and thread safety - Add model parameter implementation details - Include testing results for both features

examples: add task summary table to llama-eval-new.py

e79e8d0

eval : print progress

812ae13

eval : add prompts

fb1481d

test : fix path

9695e6f

sim : fix answer matching

8156d54

eval : support multiple dataset runs

fd90796

minor

68dde88

improve grader

d2b1030

docs

7751ae2

remove old files

1db8428

datasets : add gsm8k

e8a8075

add gpqa + sampling + docs

cffd268

rename

73e61d5

ggerganov added 21 commits February 16, 2026 10:51

grader : improve example answers

f762a71

cont

c631565

datasets : add aime2025

99e3c3d

grader : update prompt

52759bf

grade : improve regex + logs

db10dda

datasets : fix aime2025

350e7c1

cleanup

de956a6

add AGENTS.md

c6d70b9

ignore errors

ad3a54e

resume eval

e6e777c

cleanup

60a501e

fix counts

7b84af8

simplify

6c41664

fix prompts

e2e998a

add html

013963c

store full response

9c29be1

add tokens

2ffa45e

resoning and error handling

7f04986

refactor

c0c3e42

track total time

a3405d4

remove junk

1c128d9

github-actions bot added examples python python script changes labels Mar 29, 2026

ggerganov mentioned this pull request Mar 29, 2026

llama : rotate activations for better quantization #21038

Merged

2 tasks

This was referenced Mar 30, 2026

Handle reasoning budget #20297

Merged

server: save and clear idle slots on new task (--clear-idle) #20993

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

examples : add llama-eval#21152

examples : add llama-eval#21152
ggerganov wants to merge 51 commits intomasterfrom
gg/scripts-eval

ggerganov commented Mar 29, 2026 •

edited

Loading

Uh oh!

strawberrymelonpanda commented Mar 29, 2026 •

edited

Loading

Uh oh!

ggerganov commented Mar 29, 2026

Uh oh!

strawberrymelonpanda commented Mar 29, 2026 •

edited

Loading

Uh oh!

strawberrymelonpanda commented Mar 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ggerganov commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

strawberrymelonpanda commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Mar 29, 2026

Uh oh!

strawberrymelonpanda commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

strawberrymelonpanda commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ggerganov commented Mar 29, 2026 •

edited

Loading

strawberrymelonpanda commented Mar 29, 2026 •

edited

Loading

strawberrymelonpanda commented Mar 29, 2026 •

edited

Loading

strawberrymelonpanda commented Mar 29, 2026 •

edited

Loading