Zero-dependency eval harness for LLM and agent regression testing. Score outputs with exact, contains, regex, token_f1, json_valid, json_field, and citation_coverage checks. Ships a CLI and a small library API. No runtime dependencies — pure stdlib.
Python port of the @mukundakatta/ai-eval-forge npm package. Same check types, same output shape — cases files you wrote for the npm version work here unchanged.
pip install ai-eval-forgeaef score cases.jsonl
# or
ai-eval-forge score cases.jsonl --format markdownExits 0 on all pass, 1 on any failures, 2 on bad input.
Each case is a JSON object. The file can be either a JSON array or JSONL (one object per line).
{"id": "greeting", "actual": "hello world", "expected": "hello world"}
{"id": "json-output", "actual": "{\"user\":{\"name\":\"Alice\"}}", "checks":[{"type":"json_field","path":"user.name","value":"Alice"}]}
{"id": "cited", "actual": "See [src1] and [src2].", "sources":[{"id":"src1"},{"id":"src2"}], "checks":[{"type":"citation_coverage","min":1}]}| Type | What it does |
|---|---|
exact |
Normalized (lowercase, whitespace-collapsed) string equality. |
contains |
All listed substrings present in actual. Optional caseSensitive. |
regex |
Python regex match against actual. flags accepts i, m, s. |
token_f1 |
F1 over lowercase alphanumeric tokens. Default check if none specified. |
json_valid |
actual parses as valid JSON. |
json_field |
Parse JSON, drill into path, deep-equal against value. |
citation_coverage |
Fraction of source IDs from sources that appear inside actual. |
Every check accepts required (default true) and min (default 1). The case passes iff every required check has score >= min. The case's overall score is the average of all checks.
from ai_eval_forge import evaluate_suite, parse_cases, render_markdown
from pathlib import Path
cases = parse_cases(Path("cases.jsonl").read_text())
suite = evaluate_suite(cases)
print(render_markdown(suite))
print(f"Pass rate: {suite.summary.passRate:.0%}"){
"summary": {
"total": 2,
"passed": 1,
"failed": 1,
"passRate": 0.5,
"averageScore": 0.82,
"totalCostUsd": 0.0,
"averageLatencyMs": 0
},
"cases": [
{
"id": "greeting",
"passed": true,
"score": 1.0,
"checks": [{"type": "token_f1", "required": true, "passed": true, "score": 1.0, "min": 0.65, "detail": "token_f1=1.0"}],
"meta": {"input": null, "tags": [], "costUsd": 0, "latencyMs": 0}
}
]
}js_expressioncheck type is dropped. The JS version lets you run a JavaScript expression against case context. Python's equivalent (eval) is harder to sandbox, so the Python port omits this check type rather than ship a half-sandbox. If you need custom logic, useregexorjson_field— or extend the library via your ownrun_checkwrapper.
Everything else matches the npm package 1:1: same check types, same scoring formulas, same summary fields, same exit codes, same CLI flags.
pip install -e '.[dev]'
pytestMIT.