Compare baseline vs current agent runs and surface regressions as structured reasons, not vibes. Answers "why does the agent feel worse?" with seven named signals:
- Success loss — cases that passed baseline, fail current
- New error signatures — error kinds present in current but absent in baseline
- Tool failure rises — tools that failed more often in current than baseline
- Output drift — for cases both passed, how much the final text changed (token-F1)
- Step bloat — step count ratio above threshold
- Latency bloat — wall-clock ratio above threshold
- Cost bloat — USD ratio above threshold
Zero runtime dependencies. Works on any agent framework — you hand it JSONL traces, it tells you what changed.
pip install agent-run-diffagent-run-diff baseline.jsonl current.jsonlExits 0 on clean, 1 on any regression, 2 on bad input. The markdown report prints to stdout:
# Agent Regression Report
Pass rate: baseline **92%** → current **78%** (-14pp)
Total latency: 48,200 ms → 81,400 ms
Total cost: $1.2400 → $2.1800
## Summary
- Success losses: **3**
- New error kinds: **1**
- Tool failure rises: **2**
- Step bloat cases: **4**
- Latency bloat: **6**
- Cost bloat: **5**
## Success losses (was passing, now not)
- `login-flow-happy-path` — status: success → failed
...
## Tool failures rising
| Tool | Baseline | Current | Sample errors |
| --- | ---: | ---: | --- |
| browser.click | 0 | 4 | `Element not found: #submit` |
...A run is one JSON object. The shape is lenient — accepts snake_case or camelCase, and several common aliases:
{
"run_id": "run-2026-04-24-abc",
"case_id": "login-flow-happy-path",
"status": "success",
"final_output": "Logged in as alice.",
"steps": [
{
"type": "tool_call",
"tool_name": "browser.click",
"tool_args": {"selector": "#submit"},
"error": null,
"latency_ms": 420,
"cost_usd": 0.001
}
],
"total_latency_ms": 2300,
"total_cost_usd": 0.08
}Aliases recognized:
case_id|caseId|test_id|testId(falls back torun_idif none present)status|outcome|result.status— values likepass/passed/ok/successall normalize to"success"final_output|finalOutput|outputsteps|trace|events- per-step:
tool_name/toolName/name,error/error.message,latency_ms/latencyMs/duration_ms - totals:
total_cost_usd/totalCostUsd,total_latency_ms/totalLatencyMs(auto-summed from steps if absent)
Runs are matched between baseline and current by case_id.
Defaults are transparent (1.3x bloat, 0.7 output-drift F1) and overridable:
agent-run-diff base.jsonl curr.jsonl \
--step-ratio 1.5 \
--latency-ratio 2.0 \
--cost-ratio 1.2 \
--output-drift-f1 0.6 \
--min-latency-ms 200 \
--min-cost-usd 0.01The --min-latency-ms and --min-cost-usd floors prevent noise on trivially-short or trivially-cheap runs.
- name: Compare against baseline
run: |
agent-run-diff \
baselines/2026-04-20.jsonl \
runs/$GITHUB_SHA.jsonl \
--format json > regression.jsonThe workflow step fails when any regression is detected (exit 1). Attach regression.json as an artifact to make the seven signals browsable in the PR review.
- Not semantic similarity. Output drift is token-F1, not embeddings. Flagging is "text changed a lot," not "meaning changed." Labeled as such in every code path.
- Not a framework. The tool never touches your agent, never calls an LLM, never runs your tests. It only reads traces you already have.
- No pricing built-in. If your traces include
cost_usdwe use it; otherwise cost bloat is skipped. Pricing is your framework's job, not this tool's.
from pathlib import Path
from agent_run_diff import parse_runs_file, analyze, render_markdown, Thresholds
baseline = parse_runs_file("baseline.jsonl")
current = parse_runs_file("current.jsonl")
report = analyze(baseline, current, thresholds=Thresholds(latency_ratio=1.5))
print(render_markdown(report))
if report.has_regressions:
raise SystemExit(1)MIT.