examples: llama evaluation tool for mmlu, aime, gsm8k#18892
examples: llama evaluation tool for mmlu, aime, gsm8k#18892gatbontonpc wants to merge 4 commits intoggml-org:masterfrom
Conversation
|
Personally I'd really like something like this. As ggerganov implied in #18195, I find pretty much any and all of the evaluation harnesses out there to just be too much effort for too little payoff. They probably work great for researchers. "Do not include any explanation." might lower scores on models that need a bit of time to "think" out loud? Can this easily support custom benchmarks? I've written my own evaluation scripts to test new LLMs, but I'd love to scrap it in favor of something simpler like this. If so, I could probably reverse engineer the format from an existing, but if this goes to production a quick note on writing benchmarks would be welcome. Code evaluation tests would be nice-to-have, but sadly that's probably when you start getting into Docker containers for safety and/or dependencies for things like Bubblewrap, so multiple-choice probably makes sense to sidestep that complexity. |
|
I like this because it gives a more objective quality assessment for evaluation than just "it doesn't become incoherent". It also does some more irregular/batched workloads that we don't hit in llama-bench/llama-cli. |
ggerganov
left a comment
There was a problem hiding this comment.
Thanks for starting this.
My general recommendation is to start things simple and focus on supporting a single eval until we figure out exactly what we want in terms of user experience. This will simplify the review process and from there we can more easily expand with more evals.
As a first objective, we should aim to support AIME2025 as I am most familiar with it.
See my comments below for more info.
|
|
||
| def extract_boxed_text(text: str) -> str: | ||
| pattern = r"boxed{(.*?)}|framebox{(.*?)}" | ||
| matches = re.findall(pattern, text, re.DOTALL) | ||
| logger.debug(matches) | ||
| if matches: | ||
| for match in matches[::-1]: | ||
| for group in match: | ||
| if group != "": | ||
| return group.split(",")[-1].strip() | ||
| logger.debug("Could not extract boxed text. Maybe expand context window") | ||
|
|
There was a problem hiding this comment.
We should abstract this to support an external "grader" or "judge" to avoid problems that I described in: https://huggingface.co/openai/gpt-oss-120b/discussions/132
Regexing will never work well enough, but post-processing the final answer with an LLM should be solid.
| def benchmark( | ||
| path_server: str, | ||
| prompt_source: str, | ||
| n_prompts: int, | ||
| n_predict: int, | ||
| rng_seed: int, | ||
| resume_flag: bool, | ||
| checkpoint_file: Path, | ||
| log_level: int, | ||
| ): | ||
| logger.setLevel(log_level) | ||
| done, errored, checkpoint_results = read_checkpoint(checkpoint_file, resume_flag) | ||
|
|
||
| if not path_server.startswith("http://") and not path_server.startswith("https://"): | ||
| logger.error("ERROR: malformed server path") | ||
| return | ||
|
|
||
| if os.environ.get("LLAMA_ARG_N_PARALLEL") is None: | ||
| logger.info("LLAMA_ARG_N_PARALLEL not explicitly set, using 32") | ||
| os.environ["LLAMA_ARG_N_PARALLEL"] = "32" | ||
|
|
||
| parallel: int = int(os.environ.get("LLAMA_ARG_N_PARALLEL")) # type: ignore | ||
|
|
||
| task_queue: set[TaskSpec] = set() | ||
| for src in prompt_source.split(","): | ||
| if src == "all": | ||
| for v in TASK_DICT.values(): | ||
| task_queue.add(v()) | ||
| break | ||
| task_queue.add(TASK_DICT[src]()) | ||
|
|
||
| session = None | ||
| try: | ||
| server_address: str = path_server | ||
|
|
||
| adapter = requests.adapters.HTTPAdapter(pool_connections=parallel, pool_maxsize=parallel) # type: ignore | ||
| session = requests.Session() | ||
| session.mount("http://", adapter) | ||
| session.mount("https://", adapter) | ||
| file_lock = threading.Lock() | ||
| cases: list[Case] = [] | ||
| data: list[dict] = [] | ||
| for task in task_queue: | ||
| for case in task.iter_cases(n_prompts, rng_seed): | ||
| if case.case_id in done or case.case_id in errored: | ||
| logger.debug(f"Skipping case_id {case.case_id} from checkpoint") | ||
| continue | ||
|
|
||
| cases.append(case) | ||
| data.append( | ||
| { | ||
| "prompt_source": prompt_source, | ||
| "session": session, | ||
| "server_address": server_address, | ||
| "n_predict": n_predict, | ||
| "file_lock": file_lock, | ||
| "checkpoint_file": checkpoint_file, | ||
| } | ||
| ) | ||
| logger.info("Starting the benchmark...\n") | ||
| t0 = time() | ||
| results: list[dict[str, Union[str, int]]] = thread_map( | ||
| send_prompt, | ||
| cases, | ||
| data, | ||
| max_workers=parallel, | ||
| chunksize=1, | ||
| ) | ||
| finally: | ||
| if session is not None: | ||
| session.close() | ||
|
|
||
| t1 = time() | ||
| logger.info(f"\nllama-eval duration: {t1-t0:.2f} s") | ||
| results.extend(checkpoint_results) | ||
| pertask_results = aggregate_by_task(results) | ||
| print_summary(pertask_results) |
There was a problem hiding this comment.
We need a way to see the current success rate in real-time and not wait for the entire eval to finish. By default, we should see "correct / not correct" for each task. With extra verbosity, we should also see produced answer vs expected answer for each task as soon as it completes.
ggerganov
left a comment
There was a problem hiding this comment.
So to wrap up my feedback for now is to:
- Simplify as much as possible and focus on one eval
- Implement an "eval state" object:
- ID
- list of tasks
- task states
- sampling config
- Implement a "processor" object
- list of endpoints
- threads per endpoints
- grade/judge type (regex, endpoint or cli tool)
- The processor accepts an eval state and starts processing it. It will be responsible for dumping the eval state from time to time as it progresses
| MATH_TEMPLATE = """ | ||
| {question} | ||
| Do not include any explanation. Put your final answer within \\boxed{{}}. | ||
| """ |
There was a problem hiding this comment.
The OAI API supports outputs structured as e.g. JSON. I think it makes more sense to use that for extracting the answer rather than to use something like this. In any case, you should be aware that in my testing it is much harder for models to immediately output the correct answer compared to first letting them reason about it. The approach that I took was to let the model first answer normally and to then prompt it for a final answer using a structured output.
Overview
This is a bare minimum implementation of a llama evaluation tool written in python. It runs against an OpenAI-compatible endpoint (e.g. llama-server).
Feedback welcome!
Supported Tests
Example output
Codex Generated summary
This PR introduces llama-eval, a deliberately lean evaluation tool intended to make running common LLM benchmarks against an OpenAI-compatible endpoint (e.g. llama-server) trivial, portable, and easy to reason about.
The design is explicitly motivated by the discussion in #18195, where existing evaluator frameworks (e.g. NVIDIA Evaluator, GPT-OSS evals) were found to be overly complex, operationally heavy, and difficult to run locally—especially on macOS.
llama-eval prioritizes:
The result is a tool that can be cloned and run immediately, without Docker, schedulers, config files, or external services.