Skip to content

feat: kick off evaluation harness foundation#93

Merged
dgarson merged 11 commits intodgarson/forkfrom
feat/evaluation-harness
Feb 24, 2026
Merged

feat: kick off evaluation harness foundation#93
dgarson merged 11 commits intodgarson/forkfrom
feat/evaluation-harness

Conversation

@dgarson
Copy link
Copy Markdown
Owner

@dgarson dgarson commented Feb 23, 2026

Summary

  • scaffolded additive evaluation harness primitives under src/evals
  • added evaluation case model + runner interface + basic report output path writer
  • added initial harness tests and one sample smoke evaluation case
  • added a workstream plan note documenting how to extend suites

What was added

  • src/evals/types.ts: core model/runner/report contracts
  • src/evals/runner.ts: BasicEvaluationRunner sequential execution and failure capture
  • src/evals/report.ts: run id + default <baseDir>/reports/evals/<runId>.json path + file write
  • src/evals/sample-case.ts: baseline sample case (sample.echo-smoke)
  • tests:
    • src/evals/runner.test.ts
    • src/evals/report.test.ts
    • src/evals/sample-case.test.ts
  • docs:
    • docs/experiments/plans/evaluation-harness-kickoff.md

Validation

  • pnpm vitest run src/evals/*.test.ts
  • pnpm oxfmt --check src/evals/*.ts docs/experiments/plans/evaluation-harness-kickoff.md

Next steps

  • wire a CLI command to run suites by suite/tag/case id
  • add per-suite aggregate metrics and CI-friendly output formats (junit/jsonl)
  • add bounded-concurrency execution mode for non-order-dependent suites

@dgarson
Copy link
Copy Markdown
Owner Author

dgarson commented Feb 23, 2026

Routing: Tim → Xavier → merge

PR #93 — Evaluation harness foundation (feat/evaluation-harnessdgarson/fork). This targets dgarson/fork directly.

Tim: architecture pass first — is the EvalCase model + runner interface the right shape for the harness we need? Check test coverage (does the sample smoke eval actually validate the runner interface?). If architecture looks right, escalate to Xavier for final merge approval. This is Phase 1 of bs-tim-7 (Evaluation Harness + Gates).

@dgarson
Copy link
Copy Markdown
Owner Author

dgarson commented Feb 23, 2026

Run Codex sweep first, then request final architecture review.

@dgarson
Copy link
Copy Markdown
Owner Author

dgarson commented Feb 23, 2026

Progress Update (from Jerry)

eh-002 Complete ✓

Added scenario catalog schema + 2 representative benchmark scenarios:

New files:

  • src/evals/catalog.ts — Scenario catalog schema/metadata contracts
  • src/evals/cases/hitl-escalation.ts — HITL escalation + timeout scenarios
  • src/evals/cases/memory-recall.ts — Memory recall + path traversal scenarios
  • src/evals/fixtures.ts — CI-friendly fixture loading
  • src/evals/catalog.test.ts — 18 validation tests

Scenarios added (4 total):

ID Category Difficulty
hitl.escalation-smoke hitl integration
hitl.timeout-handling hitl integration
memory.recall-context memory integration
memory.path-traversal memory integration

eh-003 Seed (5-scenario target)

Currently at 4 scenarios. Need 1 more for target:

  • ✓ HITL (2 scenarios)
  • ✓ Memory (2 scenarios)
  • ⬜ Tool reliability (1 needed)
  • ⬜ Agent spawning (optional)

Gate Implications

  • validateCatalog() — validates all scenarios have required metadata, can gate CI
  • loadScenariosByCategory/difficulty/suite — enables targeted CI runs (smoke → unit → integration → e2e)
  • getCatalogStats() — enables coverage reporting by category/difficulty

Validation

  • pnpm vitest run src/evals/*.test.ts — 23 tests pass
  • pnpm lint — clean

Full PR: #96

@dgarson
Copy link
Copy Markdown
Owner Author

dgarson commented Feb 23, 2026

eh-003 Follow-up Complete (via PR #97)

Building on the scenario catalog schema from #96, added the tool-reliability benchmark scenario:

What was added:

  • 4 new tool-reliability scenarios in src/evals/cases/tool-reliability.ts:
    • tool-reliability.dispatch-smoke — basic tool dispatch structure validation
    • tool-reliability.timeout-handling — tool timeout detection and handling
    • tool-reliability.failure-recovery — fallback behavior on tool failure
    • tool-reliability.result-validation — tool result structure validation

Wired through existing catalog + fixture loading utilities (fixtures.ts).

Scenario count:

Gate Impact:

  • loadScenariosByCategory('tool-reliability') now returns 4 scenarios
  • getCatalogStats().byCategory['tool-reliability'] returns 4
  • Catalog validation (validateCatalog()) gates CI until all scenarios pass metadata schema

5-scenario target now met with 8 total scenarios.

…rios (11 new cases)

- catalog.ts: ScenarioCatalog schema with category/difficulty grouping,
  CataloguedEvaluationCase type, filterCatalog, validateScenarioMetadata
- fixtures.ts: createDefaultCatalog, loadScenarios*, getCatalogStats CI loaders
- cases/hitl-escalation.ts: escalation-smoke + timeout-handling (ported from jerry/eval-scenario-schema)
- cases/memory-recall.ts: recall-context + path-traversal (ported from jerry/eval-scenario-schema)
- cases/tool-reliability.ts: 6 new deterministic scenarios
  - dispatch-success: happy-path tool invocation
  - dispatch-unknown-tool: graceful structured error on unknown name
  - retry-on-transient-failure: retry succeeds on nth attempt
  - max-retries-exhausted: retry gives up cleanly at limit
  - timeout-abort: AbortSignal propagation cuts execution early
  - result-schema-validation: validates ok/error result shape contracts
- cases/agent-spawning.ts: 5 new deterministic scenarios
  - basic-spawn-and-complete: child agent lifecycle + result accessible to parent
  - depth-limit-enforcement: spawn rejected beyond MAX_SPAWN_DEPTH=3
  - result-routing-to-requester: result delivered to correct session only
  - orphan-cleanup-on-parent-kill: recursive kill propagates to all descendants
  - parallel-completion-ordering: N concurrent agents collect without loss
- catalog.test.ts: 37 tests covering catalog build, filter, validation,
  per-category smoke runs (all 42 suite tests green)
- index.ts: updated exports for all new symbols

All scenarios: deterministic, no LLM calls, no external services.
Catalog now: 16 cases across 5 categories.
- export-jsonl.ts: writeEvaluationJsonl writes each case as separate JSON line
- writeEvaluationJsonlSummary writes aggregate metrics in single line
- computeMetrics calculates per-suite, per-category, per-difficulty pass rates
- Supports append mode for incremental CI reporting
- 7 new tests covering path resolution, append, summary, and metrics
…L + memory) (#96)

- Add scenario catalog schema/metadata contracts in catalog.ts
- Add HITL benchmark scenarios (escalation + timeout handling)
- Add Memory path benchmark scenarios (context recall + path traversal)
- Add CI-friendly fixture loading in fixtures.ts
- Add comprehensive validation tests for catalog functionality
- Export all new types and cases from index.ts

Implements eh-002 with seed for eh-003 (5-scenario target)
* feat(evals): add scenario catalog schema + 2 benchmark scenarios (HITL + memory)

- Add scenario catalog schema/metadata contracts in catalog.ts
- Add HITL benchmark scenarios (escalation + timeout handling)
- Add Memory path benchmark scenarios (context recall + path traversal)
- Add CI-friendly fixture loading in fixtures.ts
- Add comprehensive validation tests for catalog functionality
- Export all new types and cases from index.ts

Implements eh-002 with seed for eh-003 (5-scenario target)

* feat(evals): add tool-reliability benchmark scenario (4 cases)

Adds the tool-reliability benchmark scenario to reach the 5-scenario target:
- tool-reliability.dispatch-smoke: basic tool dispatch structure validation
- tool-reliability.timeout-handling: tool timeout detection and handling
- tool-reliability.failure-recovery: fallback behavior on tool failure
- tool-reliability.result-validation: tool result structure validation

Wired through existing catalog + fixture loading utilities.
Updated tests to cover new scenarios.

eh-003 complete: 5-scenario target now met (was 4, added 4)
@dgarson dgarson merged commit 8be0d62 into dgarson/fork Feb 24, 2026
2 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant