agent-run-diff

Compare baseline vs current agent runs and surface regressions as structured reasons, not vibes. Answers "why does the agent feel worse?" with seven named signals:

Success loss — cases that passed baseline, fail current
New error signatures — error kinds present in current but absent in baseline
Tool failure rises — tools that failed more often in current than baseline
Output drift — for cases both passed, how much the final text changed (token-F1)
Step bloat — step count ratio above threshold
Latency bloat — wall-clock ratio above threshold
Cost bloat — USD ratio above threshold

Zero runtime dependencies. Works on any agent framework — you hand it JSONL traces, it tells you what changed.

Install

pip install agent-run-diff

Quick start

agent-run-diff baseline.jsonl current.jsonl

Exits 0 on clean, 1 on any regression, 2 on bad input. The markdown report prints to stdout:

# Agent Regression Report

Pass rate: baseline **92%** → current **78%** (-14pp)
Total latency: 48,200 ms → 81,400 ms
Total cost:    $1.2400 → $2.1800

## Summary
- Success losses:    **3**
- New error kinds:   **1**
- Tool failure rises: **2**
- Step bloat cases:  **4**
- Latency bloat:     **6**
- Cost bloat:        **5**

## Success losses (was passing, now not)
- `login-flow-happy-path` — status: success → failed
...

## Tool failures rising
| Tool | Baseline | Current | Sample errors |
| --- | ---: | ---: | --- |
| browser.click | 0 | 4 | `Element not found: #submit` |
...

Trace format

A run is one JSON object. The shape is lenient — accepts snake_case or camelCase, and several common aliases:

{
  "run_id": "run-2026-04-24-abc",
  "case_id": "login-flow-happy-path",
  "status": "success",
  "final_output": "Logged in as alice.",
  "steps": [
    {
      "type": "tool_call",
      "tool_name": "browser.click",
      "tool_args": {"selector": "#submit"},
      "error": null,
      "latency_ms": 420,
      "cost_usd": 0.001
    }
  ],
  "total_latency_ms": 2300,
  "total_cost_usd": 0.08
}

Aliases recognized:

case_id | caseId | test_id | testId (falls back to run_id if none present)
status | outcome | result.status — values like pass/passed/ok/success all normalize to "success"
final_output | finalOutput | output
steps | trace | events
per-step: tool_name/toolName/name, error/error.message, latency_ms/latencyMs/duration_ms
totals: total_cost_usd/totalCostUsd, total_latency_ms/totalLatencyMs (auto-summed from steps if absent)

Runs are matched between baseline and current by case_id.

Configuring thresholds

Defaults are transparent (1.3x bloat, 0.7 output-drift F1) and overridable:

agent-run-diff base.jsonl curr.jsonl \
  --step-ratio 1.5 \
  --latency-ratio 2.0 \
  --cost-ratio 1.2 \
  --output-drift-f1 0.6 \
  --min-latency-ms 200 \
  --min-cost-usd 0.01

The --min-latency-ms and --min-cost-usd floors prevent noise on trivially-short or trivially-cheap runs.

CI usage

- name: Compare against baseline
  run: |
    agent-run-diff \
      baselines/2026-04-20.jsonl \
      runs/$GITHUB_SHA.jsonl \
      --format json > regression.json

The workflow step fails when any regression is detected (exit 1). Attach regression.json as an artifact to make the seven signals browsable in the PR review.

Honest scope

Not semantic similarity. Output drift is token-F1, not embeddings. Flagging is "text changed a lot," not "meaning changed." Labeled as such in every code path.
Not a framework. The tool never touches your agent, never calls an LLM, never runs your tests. It only reads traces you already have.
No pricing built-in. If your traces include cost_usd we use it; otherwise cost bloat is skipped. Pricing is your framework's job, not this tool's.

Library API

from pathlib import Path
from agent_run_diff import parse_runs_file, analyze, render_markdown, Thresholds

baseline = parse_runs_file("baseline.jsonl")
current = parse_runs_file("current.jsonl")
report = analyze(baseline, current, thresholds=Thresholds(latency_ratio=1.5))

print(render_markdown(report))
if report.has_regressions:
    raise SystemExit(1)

License

MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
src/agent_run_diff		src/agent_run_diff
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

agent-run-diff

Install

Quick start

Trace format

Configuring thresholds

CI usage

Honest scope

Library API

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

agent-run-diff

Install

Quick start

Trace format

Configuring thresholds

CI usage

Honest scope

Library API

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages