Skip to content

examples: llama evaluation tool for mmlu, aime, gsm8k#18892

Draft
gatbontonpc wants to merge 4 commits intoggml-org:masterfrom
gatbontonpc:simple-llama-eval
Draft

examples: llama evaluation tool for mmlu, aime, gsm8k#18892
gatbontonpc wants to merge 4 commits intoggml-org:masterfrom
gatbontonpc:simple-llama-eval

Conversation

@gatbontonpc
Copy link
Copy Markdown
Contributor

Overview

This is a bare minimum implementation of a llama evaluation tool written in python. It runs against an OpenAI-compatible endpoint (e.g. llama-server).

Feedback welcome!

Supported Tests

  • GSM8K — grade-school math (free-form, numeric answers)
  • AIME — competition math (integer answers)
  • MMLU — multi-domain multiple-choice
  • HellaSwag — commonsense reasoning (multiple-choice)
  • ARC — grade-school science (multiple-choice)
  • WinoGrande — commonsense coreference (multiple-choice)

Example output

./llama-server -m model.gguf --port 8033
python examples/llama-eval/llama-eval.py \
  --path_server http://localhost:8033 \
  --prompt_source arc \
  --n_prompts 100
llama-eval duration:           2025.81 s

=== llama-eval suite summary ===
Task                 Acc  Correct    Total  Invalid    Error
-----------------------------------------------------------------
aime               0.067        6       90       14        0
arc                0.950       95      100        0        0
gsm8k              0.350       35      100       10        0
hellaswag          0.720       72      100        0        0
mmlu               0.670       67      100        0        0
winogrande         0.690       69      100        0        0
-----------------------------------------------------------------
ALL                0.583      344      590

Codex Generated summary

This PR introduces llama-eval, a deliberately lean evaluation tool intended to make running common LLM benchmarks against an OpenAI-compatible endpoint (e.g. llama-server) trivial, portable, and easy to reason about.

The design is explicitly motivated by the discussion in #18195, where existing evaluator frameworks (e.g. NVIDIA Evaluator, GPT-OSS evals) were found to be overly complex, operationally heavy, and difficult to run locally—especially on macOS.

llama-eval prioritizes:

  • Zero infrastructure overhead
  • Near-zero dependencies
  • Single-command execution
  • Direct HTTP interaction with the model server

The result is a tool that can be cloned and run immediately, without Docker, schedulers, config files, or external services.

@strawberrymelonpanda
Copy link
Copy Markdown
Contributor

Personally I'd really like something like this. As ggerganov implied in #18195, I find pretty much any and all of the evaluation harnesses out there to just be too much effort for too little payoff. They probably work great for researchers.

"Do not include any explanation." might lower scores on models that need a bit of time to "think" out loud?
Is there any randomization of the correct choice?

Can this easily support custom benchmarks? I've written my own evaluation scripts to test new LLMs, but I'd love to scrap it in favor of something simpler like this. If so, I could probably reverse engineer the format from an existing, but if this goes to production a quick note on writing benchmarks would be welcome.

Code evaluation tests would be nice-to-have, but sadly that's probably when you start getting into Docker containers for safety and/or dependencies for things like Bubblewrap, so multiple-choice probably makes sense to sidestep that complexity.

@jeffbolznv
Copy link
Copy Markdown
Contributor

I like this because it gives a more objective quality assessment for evaluation than just "it doesn't become incoherent". It also does some more irregular/batched workloads that we don't hit in llama-bench/llama-cli.

Copy link
Copy Markdown
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for starting this.

My general recommendation is to start things simple and focus on supporting a single eval until we figure out exactly what we want in terms of user experience. This will simplify the review process and from there we can more easily expand with more evals.

As a first objective, we should aim to support AIME2025 as I am most familiar with it.

See my comments below for more info.

Comment on lines +43 to +54

def extract_boxed_text(text: str) -> str:
pattern = r"boxed{(.*?)}|framebox{(.*?)}"
matches = re.findall(pattern, text, re.DOTALL)
logger.debug(matches)
if matches:
for match in matches[::-1]:
for group in match:
if group != "":
return group.split(",")[-1].strip()
logger.debug("Could not extract boxed text. Maybe expand context window")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should abstract this to support an external "grader" or "judge" to avoid problems that I described in: https://huggingface.co/openai/gpt-oss-120b/discussions/132

Regexing will never work well enough, but post-processing the final answer with an LLM should be solid.

Comment on lines +556 to +632
def benchmark(
path_server: str,
prompt_source: str,
n_prompts: int,
n_predict: int,
rng_seed: int,
resume_flag: bool,
checkpoint_file: Path,
log_level: int,
):
logger.setLevel(log_level)
done, errored, checkpoint_results = read_checkpoint(checkpoint_file, resume_flag)

if not path_server.startswith("http://") and not path_server.startswith("https://"):
logger.error("ERROR: malformed server path")
return

if os.environ.get("LLAMA_ARG_N_PARALLEL") is None:
logger.info("LLAMA_ARG_N_PARALLEL not explicitly set, using 32")
os.environ["LLAMA_ARG_N_PARALLEL"] = "32"

parallel: int = int(os.environ.get("LLAMA_ARG_N_PARALLEL")) # type: ignore

task_queue: set[TaskSpec] = set()
for src in prompt_source.split(","):
if src == "all":
for v in TASK_DICT.values():
task_queue.add(v())
break
task_queue.add(TASK_DICT[src]())

session = None
try:
server_address: str = path_server

adapter = requests.adapters.HTTPAdapter(pool_connections=parallel, pool_maxsize=parallel) # type: ignore
session = requests.Session()
session.mount("http://", adapter)
session.mount("https://", adapter)
file_lock = threading.Lock()
cases: list[Case] = []
data: list[dict] = []
for task in task_queue:
for case in task.iter_cases(n_prompts, rng_seed):
if case.case_id in done or case.case_id in errored:
logger.debug(f"Skipping case_id {case.case_id} from checkpoint")
continue

cases.append(case)
data.append(
{
"prompt_source": prompt_source,
"session": session,
"server_address": server_address,
"n_predict": n_predict,
"file_lock": file_lock,
"checkpoint_file": checkpoint_file,
}
)
logger.info("Starting the benchmark...\n")
t0 = time()
results: list[dict[str, Union[str, int]]] = thread_map(
send_prompt,
cases,
data,
max_workers=parallel,
chunksize=1,
)
finally:
if session is not None:
session.close()

t1 = time()
logger.info(f"\nllama-eval duration: {t1-t0:.2f} s")
results.extend(checkpoint_results)
pertask_results = aggregate_by_task(results)
print_summary(pertask_results)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a way to see the current success rate in real-time and not wait for the entire eval to finish. By default, we should see "correct / not correct" for each task. With extra verbosity, we should also see produced answer vs expected answer for each task as soon as it completes.

Copy link
Copy Markdown
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to wrap up my feedback for now is to:

  • Simplify as much as possible and focus on one eval
  • Implement an "eval state" object:
    • ID
    • list of tasks
    • task states
    • sampling config
  • Implement a "processor" object
    • list of endpoints
    • threads per endpoints
    • grade/judge type (regex, endpoint or cli tool)
  • The processor accepts an eval state and starts processing it. It will be responsible for dumping the eval state from time to time as it progresses

Comment on lines +23 to +26
MATH_TEMPLATE = """
{question}
Do not include any explanation. Put your final answer within \\boxed{{}}.
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OAI API supports outputs structured as e.g. JSON. I think it makes more sense to use that for extracting the answer rather than to use something like this. In any case, you should be aware that in my testing it is much harder for models to immediately output the correct answer compared to first letting them reason about it. The approach that I took was to let the model first answer normally and to then prompt it for a final answer using a structured output.

@ggerganov ggerganov mentioned this pull request Mar 29, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants