Skip to content

MukundaKatta/agent-run-diff

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

agent-run-diff

CI PyPI Python License: MIT

Compare baseline vs current agent runs and surface regressions as structured reasons, not vibes. Answers "why does the agent feel worse?" with seven named signals:

  1. Success loss — cases that passed baseline, fail current
  2. New error signatures — error kinds present in current but absent in baseline
  3. Tool failure rises — tools that failed more often in current than baseline
  4. Output drift — for cases both passed, how much the final text changed (token-F1)
  5. Step bloat — step count ratio above threshold
  6. Latency bloat — wall-clock ratio above threshold
  7. Cost bloat — USD ratio above threshold

Zero runtime dependencies. Works on any agent framework — you hand it JSONL traces, it tells you what changed.

Install

pip install agent-run-diff

Quick start

agent-run-diff baseline.jsonl current.jsonl

Exits 0 on clean, 1 on any regression, 2 on bad input. The markdown report prints to stdout:

# Agent Regression Report

Pass rate: baseline **92%** → current **78%** (-14pp)
Total latency: 48,200 ms → 81,400 ms
Total cost:    $1.2400 → $2.1800

## Summary
- Success losses:    **3**
- New error kinds:   **1**
- Tool failure rises: **2**
- Step bloat cases:  **4**
- Latency bloat:     **6**
- Cost bloat:        **5**

## Success losses (was passing, now not)
- `login-flow-happy-path` — status: success → failed
...

## Tool failures rising
| Tool | Baseline | Current | Sample errors |
| --- | ---: | ---: | --- |
| browser.click | 0 | 4 | `Element not found: #submit` |
...

Trace format

A run is one JSON object. The shape is lenient — accepts snake_case or camelCase, and several common aliases:

{
  "run_id": "run-2026-04-24-abc",
  "case_id": "login-flow-happy-path",
  "status": "success",
  "final_output": "Logged in as alice.",
  "steps": [
    {
      "type": "tool_call",
      "tool_name": "browser.click",
      "tool_args": {"selector": "#submit"},
      "error": null,
      "latency_ms": 420,
      "cost_usd": 0.001
    }
  ],
  "total_latency_ms": 2300,
  "total_cost_usd": 0.08
}

Aliases recognized:

  • case_id | caseId | test_id | testId (falls back to run_id if none present)
  • status | outcome | result.status — values like pass/passed/ok/success all normalize to "success"
  • final_output | finalOutput | output
  • steps | trace | events
  • per-step: tool_name/toolName/name, error/error.message, latency_ms/latencyMs/duration_ms
  • totals: total_cost_usd/totalCostUsd, total_latency_ms/totalLatencyMs (auto-summed from steps if absent)

Runs are matched between baseline and current by case_id.

Configuring thresholds

Defaults are transparent (1.3x bloat, 0.7 output-drift F1) and overridable:

agent-run-diff base.jsonl curr.jsonl \
  --step-ratio 1.5 \
  --latency-ratio 2.0 \
  --cost-ratio 1.2 \
  --output-drift-f1 0.6 \
  --min-latency-ms 200 \
  --min-cost-usd 0.01

The --min-latency-ms and --min-cost-usd floors prevent noise on trivially-short or trivially-cheap runs.

CI usage

- name: Compare against baseline
  run: |
    agent-run-diff \
      baselines/2026-04-20.jsonl \
      runs/$GITHUB_SHA.jsonl \
      --format json > regression.json

The workflow step fails when any regression is detected (exit 1). Attach regression.json as an artifact to make the seven signals browsable in the PR review.

Honest scope

  • Not semantic similarity. Output drift is token-F1, not embeddings. Flagging is "text changed a lot," not "meaning changed." Labeled as such in every code path.
  • Not a framework. The tool never touches your agent, never calls an LLM, never runs your tests. It only reads traces you already have.
  • No pricing built-in. If your traces include cost_usd we use it; otherwise cost bloat is skipped. Pricing is your framework's job, not this tool's.

Library API

from pathlib import Path
from agent_run_diff import parse_runs_file, analyze, render_markdown, Thresholds

baseline = parse_runs_file("baseline.jsonl")
current = parse_runs_file("current.jsonl")
report = analyze(baseline, current, thresholds=Thresholds(latency_ratio=1.5))

print(render_markdown(report))
if report.has_regressions:
    raise SystemExit(1)

License

MIT.

About

Compare baseline vs current agent runs; surface regressions as structured reasons across 7 signals (success loss, new errors, tool failures, output drift, step/latency/cost bloat). Python, CLI + library.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages