Conversation
Add a standalone Python script that simulates a llama-server HTTP endpoint for testing the eval script. The simulator: - Implements /v1/chat/completions endpoint with OpenAI-compatible format - Loads AIME dataset from HuggingFace with local caching - Uses Levenshtein distance for intelligent question matching - Supports configurable success rate for correct/wrong answer generation - Provides debug logging for troubleshooting Also includes test scripts and documentation for testing and understanding the simulator functionality.
Extract repeating question string into TEST_QUESTION variable and create make_request() helper function to reduce code duplication. Add proper error handling for error responses.
Add summary of llama-server-simulator implementation work including features, testing results, technical decisions, and refactoring.
- Create new simplified evaluation script focused only on AIME - Implement EvalState and Processor dataclasses for structured state management - Add real-time feedback showing correct/incorrect status per case - Abstract grading interface for external grader support - Use structured JSON output for eval state - Apply HuggingFace dataset caching to avoid repeated downloads - Remove Levenshtein matching - eval script only sends requests and validates answers
- Add Grader class supporting regex and CLI-based grading - Implement built-in regex patterns for AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande - Add CLI grader interface: python script.py --answer <pred> --expected <gold> - Add HF telemetry disable to avoid warnings - Support exact match requirement for regex patterns - Add 30-second timeout for CLI grader - Handle both boxed and plain text formats for AIME answers
- Add ThreadPoolExecutor for parallel request processing controlled by --threads - Add --model argument to specify model name in request data - Refactor process() to use thread-safe _process_single_case() method - Update progress tracking to work with concurrent execution
…ter updates - Add threading support implementation details - Document ThreadPoolExecutor usage and thread safety - Add model parameter implementation details - Include testing results for both features
Love the idea, but can you really trust the same 20B model to grade itself? I liked that #18892 seemed to be simple pass/fail, unless I've overlooked something. |
|
The script also supports regex-based grader. Also a custom grader with your own script. Generally, when using regex grading, I've seen quite a few false-negatives even when using the original gpt-oss sophisticated regexes. With the current Still if you spot a failure, please do report. |
Fair. I'd also be curious what the minimum viable model for the task is. i.e., can Qwen 3.5 4B solve it reliably? I'll certainly pull the branch, but hoping this one makes it to a mainline tool. 😄 |
I wonder if a "hybrid" option could cut down the eval time by only checking the false results, as a double-check. Seems like false-passes would be more rare. Depending on the task that might not make a huge difference, when pass rates are well below 50%, but just musing. |
Overview
ref #18195
cont #18892
Adds a lean and mean evaluation tool:
Sample usage:
Sample results:
CLI
HTML: llama-eval-state-aime2025-gpt-oss-120b-high-x4.json.html
Additional information
I've been vibe coding this from time to time using local models and OpenCode. Given that I don't write Python, I would guess the quality of the implementation is quite poor. Thought I've tried to keep it minimalistic. The current implementation is almost feature complete given what I initially imagined. But haven't found the time to wrap it up yet completely. If anyone is interested in helping, feel free to PR to this branch.
TODOs:
Requirements