feat: kick off evaluation harness foundation#93
Conversation
|
Routing: Tim → Xavier → merge PR #93 — Evaluation harness foundation ( Tim: architecture pass first — is the EvalCase model + runner interface the right shape for the harness we need? Check test coverage (does the sample smoke eval actually validate the runner interface?). If architecture looks right, escalate to Xavier for final merge approval. This is Phase 1 of bs-tim-7 (Evaluation Harness + Gates). |
|
Run Codex sweep first, then request final architecture review. |
Progress Update (from Jerry)eh-002 Complete ✓Added scenario catalog schema + 2 representative benchmark scenarios: New files:
Scenarios added (4 total):
eh-003 Seed (5-scenario target)Currently at 4 scenarios. Need 1 more for target:
Gate Implications
Validation
Full PR: #96 |
eh-003 Follow-up Complete (via PR #97)Building on the scenario catalog schema from #96, added the tool-reliability benchmark scenario: What was added:
Wired through existing catalog + fixture loading utilities ( Scenario count:
Gate Impact:
5-scenario target now met with 8 total scenarios. |
…rios (11 new cases) - catalog.ts: ScenarioCatalog schema with category/difficulty grouping, CataloguedEvaluationCase type, filterCatalog, validateScenarioMetadata - fixtures.ts: createDefaultCatalog, loadScenarios*, getCatalogStats CI loaders - cases/hitl-escalation.ts: escalation-smoke + timeout-handling (ported from jerry/eval-scenario-schema) - cases/memory-recall.ts: recall-context + path-traversal (ported from jerry/eval-scenario-schema) - cases/tool-reliability.ts: 6 new deterministic scenarios - dispatch-success: happy-path tool invocation - dispatch-unknown-tool: graceful structured error on unknown name - retry-on-transient-failure: retry succeeds on nth attempt - max-retries-exhausted: retry gives up cleanly at limit - timeout-abort: AbortSignal propagation cuts execution early - result-schema-validation: validates ok/error result shape contracts - cases/agent-spawning.ts: 5 new deterministic scenarios - basic-spawn-and-complete: child agent lifecycle + result accessible to parent - depth-limit-enforcement: spawn rejected beyond MAX_SPAWN_DEPTH=3 - result-routing-to-requester: result delivered to correct session only - orphan-cleanup-on-parent-kill: recursive kill propagates to all descendants - parallel-completion-ordering: N concurrent agents collect without loss - catalog.test.ts: 37 tests covering catalog build, filter, validation, per-category smoke runs (all 42 suite tests green) - index.ts: updated exports for all new symbols All scenarios: deterministic, no LLM calls, no external services. Catalog now: 16 cases across 5 categories.
- export-jsonl.ts: writeEvaluationJsonl writes each case as separate JSON line - writeEvaluationJsonlSummary writes aggregate metrics in single line - computeMetrics calculates per-suite, per-category, per-difficulty pass rates - Supports append mode for incremental CI reporting - 7 new tests covering path resolution, append, summary, and metrics
…L + memory) (#96) - Add scenario catalog schema/metadata contracts in catalog.ts - Add HITL benchmark scenarios (escalation + timeout handling) - Add Memory path benchmark scenarios (context recall + path traversal) - Add CI-friendly fixture loading in fixtures.ts - Add comprehensive validation tests for catalog functionality - Export all new types and cases from index.ts Implements eh-002 with seed for eh-003 (5-scenario target)
* feat(evals): add scenario catalog schema + 2 benchmark scenarios (HITL + memory) - Add scenario catalog schema/metadata contracts in catalog.ts - Add HITL benchmark scenarios (escalation + timeout handling) - Add Memory path benchmark scenarios (context recall + path traversal) - Add CI-friendly fixture loading in fixtures.ts - Add comprehensive validation tests for catalog functionality - Export all new types and cases from index.ts Implements eh-002 with seed for eh-003 (5-scenario target) * feat(evals): add tool-reliability benchmark scenario (4 cases) Adds the tool-reliability benchmark scenario to reach the 5-scenario target: - tool-reliability.dispatch-smoke: basic tool dispatch structure validation - tool-reliability.timeout-handling: tool timeout detection and handling - tool-reliability.failure-recovery: fallback behavior on tool failure - tool-reliability.result-validation: tool result structure validation Wired through existing catalog + fixture loading utilities. Updated tests to cover new scenarios. eh-003 complete: 5-scenario target now met (was 4, added 4)
Summary
src/evalsWhat was added
src/evals/types.ts: core model/runner/report contractssrc/evals/runner.ts:BasicEvaluationRunnersequential execution and failure capturesrc/evals/report.ts: run id + default<baseDir>/reports/evals/<runId>.jsonpath + file writesrc/evals/sample-case.ts: baseline sample case (sample.echo-smoke)src/evals/runner.test.tssrc/evals/report.test.tssrc/evals/sample-case.test.tsdocs/experiments/plans/evaluation-harness-kickoff.mdValidation
pnpm vitest run src/evals/*.test.tspnpm oxfmt --check src/evals/*.ts docs/experiments/plans/evaluation-harness-kickoff.mdNext steps