feat(evals): scenario catalog schema + 2 benchmark scenarios (HITL + memory) by dgarson · Pull Request #96 · dgarson/clawdbot

dgarson · 2026-02-23T08:43:49Z

Summary

Added scenario catalog schema/metadata contracts in catalog.ts
Added 2 representative benchmark scenarios:
- HITL (Human-In-The-Loop): hitl.escalation-smoke, hitl.timeout-handling
- Memory path: memory.recall-context, memory.path-traversal
Added CI-friendly fixture loading in fixtures.ts
Added comprehensive validation tests (18 tests)

What was added

src/evals/catalog.ts: Scenario catalog schema with categories, difficulty tiers, metadata validation
src/evals/cases/hitl-escalation.ts: HITL escalation + timeout handling scenarios
src/evals/cases/memory-recall.ts: Memory context recall + path traversal scenarios
src/evals/fixtures.ts: CI-friendly scenario loading utilities
src/evals/catalog.test.ts: 18 tests for catalog functionality

Validation

pnpm vitest run src/evals/*.test.ts — 23 tests pass
pnpm lint — clean

eh-002 Complete / eh-003 Seed

This completes eh-002 and seeds eh-003:

eh-002 (done): Scenario catalog schema + 2 scenarios (HITL + memory)
eh-003 (seeded): 2 additional scenarios needed for full 5-scenario target

Remaining for 5-scenario target (eh-003):

Tool reliability scenario
Agent spawning scenario

Gate Implications:

Catalog validation (validateCatalog()) can gate CI until all scenarios pass metadata schema
Scenario filtering enables tiered CI: smoke → unit → integration → e2e
getCatalogStats() enables reporting on scenario coverage by category/difficulty

…L + memory) - Add scenario catalog schema/metadata contracts in catalog.ts - Add HITL benchmark scenarios (escalation + timeout handling) - Add Memory path benchmark scenarios (context recall + path traversal) - Add CI-friendly fixture loading in fixtures.ts - Add comprehensive validation tests for catalog functionality - Export all new types and cases from index.ts Implements eh-002 with seed for eh-003 (5-scenario target)

dgarson · 2026-02-23T09:13:03Z

eh-003 Follow-up Complete

Added the tool-reliability benchmark scenario (PR #97) to reach the 5-scenario target:

What was added:

4 new tool-reliability scenarios in src/evals/cases/tool-reliability.ts:
- tool-reliability.dispatch-smoke — basic tool dispatch structure validation
- tool-reliability.timeout-handling — tool timeout detection and handling
- tool-reliability.failure-recovery — fallback behavior on tool failure
- tool-reliability.result-validation — tool result structure validation

Scenario count:

Before	After
4 scenarios (2 HITL + 2 memory)	8 scenarios (+ 4 tool-reliability)

Gate Impact:

loadScenariosByCategory('tool-reliability') returns 4 scenarios
getCatalogStats().total now returns 8
All 27 tests pass (pnpm vitest run src/evals/*.test.ts)

The 5-scenario target is now exceeded with 8 total scenarios.

dgarson · 2026-02-23T14:25:02Z

Architecture Review (Tim)

Target: feat/evaluation-harness ✓ — Correctly targeted.

Content Review:

catalog.ts (193 lines) — Scenario catalog schema with registration and lookup
catalog.test.ts (198 lines) — Comprehensive catalog tests
fixtures.ts (185 lines) — Test fixtures and utilities
hitl-escalation.ts (121 lines) — HITL escalation benchmark scenario
memory-recall.ts (135 lines) — Memory recall benchmark scenario
index.ts — Public exports

Code Quality:

Clean schema definition for eval scenarios
Good fixture isolation for reproducible tests
Both benchmark scenarios have clear success criteria

Verdict: ✅ LGTM — Foundational schema work for the evaluation harness. Will merge once CI passes.

Resolved conflicts in: - src/evals/cases/hitl-escalation.ts - src/evals/cases/memory-recall.ts - src/evals/catalog.test.ts - src/evals/fixtures.ts - src/evals/index.ts Accepted upstream changes which include expanded scenarios for tool-reliability and agent-spawning categories. Also fixed minor lint error in ProviderRoutingPanel.tsx (unused param).

* feat: scaffold evaluation harness foundation * docs: add evaluation harness extension workstream note * UX: add ChannelBroadcastCenter — unified messaging broadcast and channel management (openclaw#324) * feat(evals): catalog schema + tool-reliability & agent-spawning scenarios (11 new cases) - catalog.ts: ScenarioCatalog schema with category/difficulty grouping, CataloguedEvaluationCase type, filterCatalog, validateScenarioMetadata - fixtures.ts: createDefaultCatalog, loadScenarios*, getCatalogStats CI loaders - cases/hitl-escalation.ts: escalation-smoke + timeout-handling (ported from jerry/eval-scenario-schema) - cases/memory-recall.ts: recall-context + path-traversal (ported from jerry/eval-scenario-schema) - cases/tool-reliability.ts: 6 new deterministic scenarios - dispatch-success: happy-path tool invocation - dispatch-unknown-tool: graceful structured error on unknown name - retry-on-transient-failure: retry succeeds on nth attempt - max-retries-exhausted: retry gives up cleanly at limit - timeout-abort: AbortSignal propagation cuts execution early - result-schema-validation: validates ok/error result shape contracts - cases/agent-spawning.ts: 5 new deterministic scenarios - basic-spawn-and-complete: child agent lifecycle + result accessible to parent - depth-limit-enforcement: spawn rejected beyond MAX_SPAWN_DEPTH=3 - result-routing-to-requester: result delivered to correct session only - orphan-cleanup-on-parent-kill: recursive kill propagates to all descendants - parallel-completion-ordering: N concurrent agents collect without loss - catalog.test.ts: 37 tests covering catalog build, filter, validation, per-category smoke runs (all 42 suite tests green) - index.ts: updated exports for all new symbols All scenarios: deterministic, no LLM calls, no external services. Catalog now: 16 cases across 5 categories. * UX: add ProviderRoutingPanel — AI provider routing and failover dashboard (openclaw#325) * feat(evals): add JSONL export adapter for CI integration - export-jsonl.ts: writeEvaluationJsonl writes each case as separate JSON line - writeEvaluationJsonlSummary writes aggregate metrics in single line - computeMetrics calculates per-suite, per-category, per-difficulty pass rates - Supports append mode for incremental CI reporting - 7 new tests covering path resolution, append, summary, and metrics * feat(evaluation-harness): add WORKSTREAM.md and basic benchmark test * feat(evals): add scenario catalog schema + 2 benchmark scenarios (HITL + memory) (#96) - Add scenario catalog schema/metadata contracts in catalog.ts - Add HITL benchmark scenarios (escalation + timeout handling) - Add Memory path benchmark scenarios (context recall + path traversal) - Add CI-friendly fixture loading in fixtures.ts - Add comprehensive validation tests for catalog functionality - Export all new types and cases from index.ts Implements eh-002 with seed for eh-003 (5-scenario target) * feat(evals): add tool-reliability benchmark scenario (4 cases) (#97) * feat(evals): add scenario catalog schema + 2 benchmark scenarios (HITL + memory) - Add scenario catalog schema/metadata contracts in catalog.ts - Add HITL benchmark scenarios (escalation + timeout handling) - Add Memory path benchmark scenarios (context recall + path traversal) - Add CI-friendly fixture loading in fixtures.ts - Add comprehensive validation tests for catalog functionality - Export all new types and cases from index.ts Implements eh-002 with seed for eh-003 (5-scenario target) * feat(evals): add tool-reliability benchmark scenario (4 cases) Adds the tool-reliability benchmark scenario to reach the 5-scenario target: - tool-reliability.dispatch-smoke: basic tool dispatch structure validation - tool-reliability.timeout-handling: tool timeout detection and handling - tool-reliability.failure-recovery: fallback behavior on tool failure - tool-reliability.result-validation: tool result structure validation Wired through existing catalog + fixture loading utilities. Updated tests to cover new scenarios. eh-003 complete: 5-scenario target now met (was 4, added 4) * Delete WORKSTREAM.md

dgarson mentioned this pull request Feb 23, 2026

feat: kick off evaluation harness foundation #93

Merged

dgarson merged commit b6cc776 into feat/evaluation-harness Feb 23, 2026
2 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evals): scenario catalog schema + 2 benchmark scenarios (HITL + memory)#96

feat(evals): scenario catalog schema + 2 benchmark scenarios (HITL + memory)#96
dgarson merged 2 commits intofeat/evaluation-harnessfrom
jerry/eval-scenario-schema

dgarson commented Feb 23, 2026

Uh oh!

dgarson commented Feb 23, 2026

Uh oh!

dgarson commented Feb 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dgarson commented Feb 23, 2026

Summary

What was added

Validation

eh-002 Complete / eh-003 Seed

Remaining for 5-scenario target (eh-003):

Gate Implications:

Uh oh!

dgarson commented Feb 23, 2026

eh-003 Follow-up Complete

What was added:

Scenario count:

Gate Impact:

Uh oh!

dgarson commented Feb 23, 2026

Architecture Review (Tim)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant