Nerve's evaluation mode is a strategic feature designed to make benchmarking and validating agents easy, reproducible, and formalized.
⚡ Unlike most tools in the LLM ecosystem, Nerve offers a built-in framework to test agents against structured cases, log results, and compare performance across models. It introduces a standard formalism for agent evaluation that does not exist elsewhere.
Evaluation mode is useful for:
- Verifying agent correctness during development
- Regression testing when updating prompts, tools, or models
- Comparing different model backends
- Collecting structured performance metrics
You run evaluations using:
nerve eval path/to/evaluation --output results.jsonEach case is passed to the agent, and results (e.g., completion, duration, output) are saved.
Evaluation runs can automatically generate trace files for debugging:
nerve eval path/to/evaluation --output results.json --trace eval-trace.jsonlThis captures all events during evaluation for analysis, including tool calls, variable changes, and execution flow.
Nerve supports three evaluation case formats:
For small test suites. Example:
- level1:
program: "A# #A"
- level2:
program: "A# #B B# #A"Used like this in the agent:
task: >
Consider this program:
{{ program }}
Compute it step-by-step and submit the result.Used in eval-ab.
For large, structured datasets. Example from eval-mmlu:
task: >
## Question
{{ question }}
Use the `select_choice` tool to pick the right answer:
{% for choice in choices %}
- [{{ loop.index0 }}] {{ choice }}
{% endfor %}Can use HuggingFace datasets (e.g., MMLU) directly.
Organize each case in its own folder:
cases/
level0/
input.txt
level1/
input.txt
Useful when tools/scripts dynamically load inputs. See eval-regex.
Results are written to a .json file with details like:
- Case identifier
- Task outcome (success/failure)
- Runtime duration
- Agent/tool outputs
- You can define multiple runs per case for robustness
- Compatible with any agent setup (tools, MCP, workflows, etc.)
- All variables from each case are injected via
{{ ... }}
- concepts.md
- index.md: CLI usage
- mcp.md: when using remote agents or tools in evaluation