examples: llama evaluation tool for mmlu, aime, gsm8k by gatbontonpc · Pull Request #18892 · ggml-org/llama.cpp

gatbontonpc · 2026-01-17T02:52:18Z

Overview

This is a bare minimum implementation of a llama evaluation tool written in python. It runs against an OpenAI-compatible endpoint (e.g. llama-server).

Feedback welcome!

Supported Tests

GSM8K — grade-school math (free-form, numeric answers)
AIME — competition math (integer answers)
MMLU — multi-domain multiple-choice
HellaSwag — commonsense reasoning (multiple-choice)
ARC — grade-school science (multiple-choice)
WinoGrande — commonsense coreference (multiple-choice)

Example output

./llama-server -m model.gguf --port 8033
python examples/llama-eval/llama-eval.py \
  --path_server http://localhost:8033 \
  --prompt_source arc \
  --n_prompts 100

llama-eval duration:           2025.81 s

=== llama-eval suite summary ===
Task                 Acc  Correct    Total  Invalid    Error
-----------------------------------------------------------------
aime               0.067        6       90       14        0
arc                0.950       95      100        0        0
gsm8k              0.350       35      100       10        0
hellaswag          0.720       72      100        0        0
mmlu               0.670       67      100        0        0
winogrande         0.690       69      100        0        0
-----------------------------------------------------------------
ALL                0.583      344      590

Codex Generated summary

This PR introduces llama-eval, a deliberately lean evaluation tool intended to make running common LLM benchmarks against an OpenAI-compatible endpoint (e.g. llama-server) trivial, portable, and easy to reason about.

The design is explicitly motivated by the discussion in #18195, where existing evaluator frameworks (e.g. NVIDIA Evaluator, GPT-OSS evals) were found to be overly complex, operationally heavy, and difficult to run locally—especially on macOS.

llama-eval prioritizes:

Zero infrastructure overhead
Near-zero dependencies
Single-command execution
Direct HTTP interaction with the model server

The result is a tool that can be cloned and run immediately, without Docker, schedulers, config files, or external services.

strawberrymelonpanda · 2026-01-17T22:09:44Z

Personally I'd really like something like this. As ggerganov implied in #18195, I find pretty much any and all of the evaluation harnesses out there to just be too much effort for too little payoff. They probably work great for researchers.

"Do not include any explanation." might lower scores on models that need a bit of time to "think" out loud?
Is there any randomization of the correct choice?

Can this easily support custom benchmarks? I've written my own evaluation scripts to test new LLMs, but I'd love to scrap it in favor of something simpler like this. If so, I could probably reverse engineer the format from an existing, but if this goes to production a quick note on writing benchmarks would be welcome.

Code evaluation tests would be nice-to-have, but sadly that's probably when you start getting into Docker containers for safety and/or dependencies for things like Bubblewrap, so multiple-choice probably makes sense to sidestep that complexity.

jeffbolznv · 2026-01-17T22:30:32Z

I like this because it gives a more objective quality assessment for evaluation than just "it doesn't become incoherent". It also does some more irregular/batched workloads that we don't hit in llama-bench/llama-cli.

ggerganov

Thanks for starting this.

My general recommendation is to start things simple and focus on supporting a single eval until we figure out exactly what we want in terms of user experience. This will simplify the review process and from there we can more easily expand with more evals.

As a first objective, we should aim to support AIME2025 as I am most familiar with it.

See my comments below for more info.