Skip to content

feat(evals): scenario catalog schema + 2 benchmark scenarios (HITL + memory)#96

Merged
dgarson merged 2 commits intofeat/evaluation-harnessfrom
jerry/eval-scenario-schema
Feb 23, 2026
Merged

feat(evals): scenario catalog schema + 2 benchmark scenarios (HITL + memory)#96
dgarson merged 2 commits intofeat/evaluation-harnessfrom
jerry/eval-scenario-schema

Conversation

@dgarson
Copy link
Copy Markdown
Owner

@dgarson dgarson commented Feb 23, 2026

Summary

  • Added scenario catalog schema/metadata contracts in catalog.ts
  • Added 2 representative benchmark scenarios:
    • HITL (Human-In-The-Loop): hitl.escalation-smoke, hitl.timeout-handling
    • Memory path: memory.recall-context, memory.path-traversal
  • Added CI-friendly fixture loading in fixtures.ts
  • Added comprehensive validation tests (18 tests)

What was added

  • src/evals/catalog.ts: Scenario catalog schema with categories, difficulty tiers, metadata validation
  • src/evals/cases/hitl-escalation.ts: HITL escalation + timeout handling scenarios
  • src/evals/cases/memory-recall.ts: Memory context recall + path traversal scenarios
  • src/evals/fixtures.ts: CI-friendly scenario loading utilities
  • src/evals/catalog.test.ts: 18 tests for catalog functionality

Validation

  • pnpm vitest run src/evals/*.test.ts — 23 tests pass
  • pnpm lint — clean

eh-002 Complete / eh-003 Seed

This completes eh-002 and seeds eh-003:

  • eh-002 (done): Scenario catalog schema + 2 scenarios (HITL + memory)
  • eh-003 (seeded): 2 additional scenarios needed for full 5-scenario target

Remaining for 5-scenario target (eh-003):

  1. Tool reliability scenario
  2. Agent spawning scenario

Gate Implications:

  • Catalog validation (validateCatalog()) can gate CI until all scenarios pass metadata schema
  • Scenario filtering enables tiered CI: smoke → unit → integration → e2e
  • getCatalogStats() enables reporting on scenario coverage by category/difficulty

…L + memory)

- Add scenario catalog schema/metadata contracts in catalog.ts
- Add HITL benchmark scenarios (escalation + timeout handling)
- Add Memory path benchmark scenarios (context recall + path traversal)
- Add CI-friendly fixture loading in fixtures.ts
- Add comprehensive validation tests for catalog functionality
- Export all new types and cases from index.ts

Implements eh-002 with seed for eh-003 (5-scenario target)
@dgarson
Copy link
Copy Markdown
Owner Author

dgarson commented Feb 23, 2026

eh-003 Follow-up Complete

Added the tool-reliability benchmark scenario (PR #97) to reach the 5-scenario target:

What was added:

  • 4 new tool-reliability scenarios in src/evals/cases/tool-reliability.ts:
    • tool-reliability.dispatch-smoke — basic tool dispatch structure validation
    • tool-reliability.timeout-handling — tool timeout detection and handling
    • tool-reliability.failure-recovery — fallback behavior on tool failure
    • tool-reliability.result-validation — tool result structure validation

Scenario count:

Before After
4 scenarios (2 HITL + 2 memory) 8 scenarios (+ 4 tool-reliability)

Gate Impact:

  • loadScenariosByCategory('tool-reliability') returns 4 scenarios
  • getCatalogStats().total now returns 8
  • All 27 tests pass (pnpm vitest run src/evals/*.test.ts)

The 5-scenario target is now exceeded with 8 total scenarios.

@dgarson
Copy link
Copy Markdown
Owner Author

dgarson commented Feb 23, 2026

Architecture Review (Tim)

Target: feat/evaluation-harness ✓ — Correctly targeted.

Content Review:

  • catalog.ts (193 lines) — Scenario catalog schema with registration and lookup
  • catalog.test.ts (198 lines) — Comprehensive catalog tests
  • fixtures.ts (185 lines) — Test fixtures and utilities
  • hitl-escalation.ts (121 lines) — HITL escalation benchmark scenario
  • memory-recall.ts (135 lines) — Memory recall benchmark scenario
  • index.ts — Public exports

Code Quality:

  • Clean schema definition for eval scenarios
  • Good fixture isolation for reproducible tests
  • Both benchmark scenarios have clear success criteria

Verdict:LGTM — Foundational schema work for the evaluation harness. Will merge once CI passes.

Resolved conflicts in:
- src/evals/cases/hitl-escalation.ts
- src/evals/cases/memory-recall.ts
- src/evals/catalog.test.ts
- src/evals/fixtures.ts
- src/evals/index.ts

Accepted upstream changes which include expanded scenarios for
tool-reliability and agent-spawning categories.

Also fixed minor lint error in ProviderRoutingPanel.tsx (unused param).
@dgarson dgarson merged commit b6cc776 into feat/evaluation-harness Feb 23, 2026
2 of 9 checks passed
dgarson added a commit that referenced this pull request Feb 24, 2026
* feat: scaffold evaluation harness foundation

* docs: add evaluation harness extension workstream note

* UX: add ChannelBroadcastCenter — unified messaging broadcast and channel management (openclaw#324)

* feat(evals): catalog schema + tool-reliability & agent-spawning scenarios (11 new cases)

- catalog.ts: ScenarioCatalog schema with category/difficulty grouping,
  CataloguedEvaluationCase type, filterCatalog, validateScenarioMetadata
- fixtures.ts: createDefaultCatalog, loadScenarios*, getCatalogStats CI loaders
- cases/hitl-escalation.ts: escalation-smoke + timeout-handling (ported from jerry/eval-scenario-schema)
- cases/memory-recall.ts: recall-context + path-traversal (ported from jerry/eval-scenario-schema)
- cases/tool-reliability.ts: 6 new deterministic scenarios
  - dispatch-success: happy-path tool invocation
  - dispatch-unknown-tool: graceful structured error on unknown name
  - retry-on-transient-failure: retry succeeds on nth attempt
  - max-retries-exhausted: retry gives up cleanly at limit
  - timeout-abort: AbortSignal propagation cuts execution early
  - result-schema-validation: validates ok/error result shape contracts
- cases/agent-spawning.ts: 5 new deterministic scenarios
  - basic-spawn-and-complete: child agent lifecycle + result accessible to parent
  - depth-limit-enforcement: spawn rejected beyond MAX_SPAWN_DEPTH=3
  - result-routing-to-requester: result delivered to correct session only
  - orphan-cleanup-on-parent-kill: recursive kill propagates to all descendants
  - parallel-completion-ordering: N concurrent agents collect without loss
- catalog.test.ts: 37 tests covering catalog build, filter, validation,
  per-category smoke runs (all 42 suite tests green)
- index.ts: updated exports for all new symbols

All scenarios: deterministic, no LLM calls, no external services.
Catalog now: 16 cases across 5 categories.

* UX: add ProviderRoutingPanel — AI provider routing and failover dashboard (openclaw#325)

* feat(evals): add JSONL export adapter for CI integration

- export-jsonl.ts: writeEvaluationJsonl writes each case as separate JSON line
- writeEvaluationJsonlSummary writes aggregate metrics in single line
- computeMetrics calculates per-suite, per-category, per-difficulty pass rates
- Supports append mode for incremental CI reporting
- 7 new tests covering path resolution, append, summary, and metrics

* feat(evaluation-harness): add WORKSTREAM.md and basic benchmark test

* feat(evals): add scenario catalog schema + 2 benchmark scenarios (HITL + memory) (#96)

- Add scenario catalog schema/metadata contracts in catalog.ts
- Add HITL benchmark scenarios (escalation + timeout handling)
- Add Memory path benchmark scenarios (context recall + path traversal)
- Add CI-friendly fixture loading in fixtures.ts
- Add comprehensive validation tests for catalog functionality
- Export all new types and cases from index.ts

Implements eh-002 with seed for eh-003 (5-scenario target)

* feat(evals): add tool-reliability benchmark scenario (4 cases) (#97)

* feat(evals): add scenario catalog schema + 2 benchmark scenarios (HITL + memory)

- Add scenario catalog schema/metadata contracts in catalog.ts
- Add HITL benchmark scenarios (escalation + timeout handling)
- Add Memory path benchmark scenarios (context recall + path traversal)
- Add CI-friendly fixture loading in fixtures.ts
- Add comprehensive validation tests for catalog functionality
- Export all new types and cases from index.ts

Implements eh-002 with seed for eh-003 (5-scenario target)

* feat(evals): add tool-reliability benchmark scenario (4 cases)

Adds the tool-reliability benchmark scenario to reach the 5-scenario target:
- tool-reliability.dispatch-smoke: basic tool dispatch structure validation
- tool-reliability.timeout-handling: tool timeout detection and handling
- tool-reliability.failure-recovery: fallback behavior on tool failure
- tool-reliability.result-validation: tool result structure validation

Wired through existing catalog + fixture loading utilities.
Updated tests to cover new scenarios.

eh-003 complete: 5-scenario target now met (was 4, added 4)

* Delete WORKSTREAM.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant