Skip to content
This repository was archived by the owner on Feb 17, 2026. It is now read-only.

Latest commit

 

History

History
100 lines (76 loc) · 2.66 KB

File metadata and controls

100 lines (76 loc) · 2.66 KB

Evaluation Mode

Nerve's evaluation mode is a strategic feature designed to make benchmarking and validating agents easy, reproducible, and formalized.

⚡ Unlike most tools in the LLM ecosystem, Nerve offers a built-in framework to test agents against structured cases, log results, and compare performance across models. It introduces a standard formalism for agent evaluation that does not exist elsewhere.

🎯 Why Use It?

Evaluation mode is useful for:

  • Verifying agent correctness during development
  • Regression testing when updating prompts, tools, or models
  • Comparing different model backends
  • Collecting structured performance metrics

🧪 Running an Evaluation

You run evaluations using:

nerve eval path/to/evaluation --output results.json

Each case is passed to the agent, and results (e.g., completion, duration, output) are saved.

Trace Integration

Evaluation runs can automatically generate trace files for debugging:

nerve eval path/to/evaluation --output results.json --trace eval-trace.jsonl

This captures all events during evaluation for analysis, including tool calls, variable changes, and execution flow.

🗂 Case Formats

Nerve supports three evaluation case formats:

1. cases.yml

For small test suites. Example:

- level1:
    program: "A# #A"
- level2:
    program: "A# #B B# #A"

Used like this in the agent:

task: >
  Consider this program:

  {{ program }}

  Compute it step-by-step and submit the result.

Used in eval-ab.

2. cases.parquet

For large, structured datasets. Example from eval-mmlu:

task: >
  ## Question

  {{ question }}

  Use the `select_choice` tool to pick the right answer:
  {% for choice in choices %}
  - [{{ loop.index0 }}] {{ choice }}
  {% endfor %}

Can use HuggingFace datasets (e.g., MMLU) directly.

3. Folder-Based cases/

Organize each case in its own folder:

cases/
  level0/
    input.txt
  level1/
    input.txt

Useful when tools/scripts dynamically load inputs. See eval-regex.

🧪 Output

Results are written to a .json file with details like:

  • Case identifier
  • Task outcome (success/failure)
  • Runtime duration
  • Agent/tool outputs

📎 Notes

  • You can define multiple runs per case for robustness
  • Compatible with any agent setup (tools, MCP, workflows, etc.)
  • All variables from each case are injected via {{ ... }}

🧭 Related Docs