feat: kick off evaluation harness foundation by dgarson · Pull Request #93 · dgarson/clawdbot

dgarson · 2026-02-23T07:39:44Z

Summary

scaffolded additive evaluation harness primitives under src/evals
added evaluation case model + runner interface + basic report output path writer
added initial harness tests and one sample smoke evaluation case
added a workstream plan note documenting how to extend suites

What was added

src/evals/types.ts: core model/runner/report contracts
src/evals/runner.ts: BasicEvaluationRunner sequential execution and failure capture
src/evals/report.ts: run id + default <baseDir>/reports/evals/<runId>.json path + file write
src/evals/sample-case.ts: baseline sample case (sample.echo-smoke)
tests:
- src/evals/runner.test.ts
- src/evals/report.test.ts
- src/evals/sample-case.test.ts
docs:
- docs/experiments/plans/evaluation-harness-kickoff.md

Validation

pnpm vitest run src/evals/*.test.ts
pnpm oxfmt --check src/evals/*.ts docs/experiments/plans/evaluation-harness-kickoff.md

Next steps

wire a CLI command to run suites by suite/tag/case id
add per-suite aggregate metrics and CI-friendly output formats (junit/jsonl)
add bounded-concurrency execution mode for non-order-dependent suites

dgarson · 2026-02-23T08:29:11Z

Routing: Tim → Xavier → merge

PR #93 — Evaluation harness foundation (feat/evaluation-harness → dgarson/fork). This targets dgarson/fork directly.

Tim: architecture pass first — is the EvalCase model + runner interface the right shape for the harness we need? Check test coverage (does the sample smoke eval actually validate the runner interface?). If architecture looks right, escalate to Xavier for final merge approval. This is Phase 1 of bs-tim-7 (Evaluation Harness + Gates).

dgarson · 2026-02-23T08:34:02Z

Run Codex sweep first, then request final architecture review.

dgarson · 2026-02-23T08:44:01Z

Progress Update (from Jerry)

eh-002 Complete ✓

Added scenario catalog schema + 2 representative benchmark scenarios:

New files:

src/evals/catalog.ts — Scenario catalog schema/metadata contracts
src/evals/cases/hitl-escalation.ts — HITL escalation + timeout scenarios
src/evals/cases/memory-recall.ts — Memory recall + path traversal scenarios
src/evals/fixtures.ts — CI-friendly fixture loading
src/evals/catalog.test.ts — 18 validation tests

Scenarios added (4 total):

ID	Category	Difficulty
`hitl.escalation-smoke`	hitl	integration
`hitl.timeout-handling`	hitl	integration
`memory.recall-context`	memory	integration
`memory.path-traversal`	memory	integration

eh-003 Seed (5-scenario target)

Currently at 4 scenarios. Need 1 more for target:

✓ HITL (2 scenarios)
✓ Memory (2 scenarios)
⬜ Tool reliability (1 needed)
⬜ Agent spawning (optional)

Gate Implications

validateCatalog() — validates all scenarios have required metadata, can gate CI
loadScenariosByCategory/difficulty/suite — enables targeted CI runs (smoke → unit → integration → e2e)
getCatalogStats() — enables coverage reporting by category/difficulty

Validation

pnpm vitest run src/evals/*.test.ts — 23 tests pass
pnpm lint — clean

Full PR: #96

dgarson · 2026-02-23T09:13:12Z

eh-003 Follow-up Complete (via PR #97)

Building on the scenario catalog schema from #96, added the tool-reliability benchmark scenario:

What was added:

4 new tool-reliability scenarios in src/evals/cases/tool-reliability.ts:
- tool-reliability.dispatch-smoke — basic tool dispatch structure validation
- tool-reliability.timeout-handling — tool timeout detection and handling
- tool-reliability.failure-recovery — fallback behavior on tool failure
- tool-reliability.result-validation — tool result structure validation

Wired through existing catalog + fixture loading utilities (fixtures.ts).

Scenario count:

Before: 4 scenarios (seeded from feat(evals): scenario catalog schema + 2 benchmark scenarios (HITL + memory) #96)
After: 8 scenarios (+ 4 tool-reliability)

Gate Impact:

loadScenariosByCategory('tool-reliability') now returns 4 scenarios
getCatalogStats().byCategory['tool-reliability'] returns 4
Catalog validation (validateCatalog()) gates CI until all scenarios pass metadata schema

5-scenario target now met with 8 total scenarios.

…nel management (openclaw#324)

…rios (11 new cases) - catalog.ts: ScenarioCatalog schema with category/difficulty grouping, CataloguedEvaluationCase type, filterCatalog, validateScenarioMetadata - fixtures.ts: createDefaultCatalog, loadScenarios*, getCatalogStats CI loaders - cases/hitl-escalation.ts: escalation-smoke + timeout-handling (ported from jerry/eval-scenario-schema) - cases/memory-recall.ts: recall-context + path-traversal (ported from jerry/eval-scenario-schema) - cases/tool-reliability.ts: 6 new deterministic scenarios - dispatch-success: happy-path tool invocation - dispatch-unknown-tool: graceful structured error on unknown name - retry-on-transient-failure: retry succeeds on nth attempt - max-retries-exhausted: retry gives up cleanly at limit - timeout-abort: AbortSignal propagation cuts execution early - result-schema-validation: validates ok/error result shape contracts - cases/agent-spawning.ts: 5 new deterministic scenarios - basic-spawn-and-complete: child agent lifecycle + result accessible to parent - depth-limit-enforcement: spawn rejected beyond MAX_SPAWN_DEPTH=3 - result-routing-to-requester: result delivered to correct session only - orphan-cleanup-on-parent-kill: recursive kill propagates to all descendants - parallel-completion-ordering: N concurrent agents collect without loss - catalog.test.ts: 37 tests covering catalog build, filter, validation, per-category smoke runs (all 42 suite tests green) - index.ts: updated exports for all new symbols All scenarios: deterministic, no LLM calls, no external services. Catalog now: 16 cases across 5 categories.

…oard (openclaw#325)

- export-jsonl.ts: writeEvaluationJsonl writes each case as separate JSON line - writeEvaluationJsonlSummary writes aggregate metrics in single line - computeMetrics calculates per-suite, per-category, per-difficulty pass rates - Supports append mode for incremental CI reporting - 7 new tests covering path resolution, append, summary, and metrics

…L + memory) (#96) - Add scenario catalog schema/metadata contracts in catalog.ts - Add HITL benchmark scenarios (escalation + timeout handling) - Add Memory path benchmark scenarios (context recall + path traversal) - Add CI-friendly fixture loading in fixtures.ts - Add comprehensive validation tests for catalog functionality - Export all new types and cases from index.ts Implements eh-002 with seed for eh-003 (5-scenario target)

* feat(evals): add scenario catalog schema + 2 benchmark scenarios (HITL + memory) - Add scenario catalog schema/metadata contracts in catalog.ts - Add HITL benchmark scenarios (escalation + timeout handling) - Add Memory path benchmark scenarios (context recall + path traversal) - Add CI-friendly fixture loading in fixtures.ts - Add comprehensive validation tests for catalog functionality - Export all new types and cases from index.ts Implements eh-002 with seed for eh-003 (5-scenario target) * feat(evals): add tool-reliability benchmark scenario (4 cases) Adds the tool-reliability benchmark scenario to reach the 5-scenario target: - tool-reliability.dispatch-smoke: basic tool dispatch structure validation - tool-reliability.timeout-handling: tool timeout detection and handling - tool-reliability.failure-recovery: fallback behavior on tool failure - tool-reliability.result-validation: tool result structure validation Wired through existing catalog + fixture loading utilities. Updated tests to cover new scenarios. eh-003 complete: 5-scenario target now met (was 4, added 4)

dgarson added 2 commits February 23, 2026 00:39

feat: scaffold evaluation harness foundation

78d1a76

docs: add evaluation harness extension workstream note

45c9f8e

dgarson added 9 commits February 23, 2026 03:07

UX: add ChannelBroadcastCenter — unified messaging broadcast and chan…

e0d71f7

…nel management (openclaw#324)

UX: add ProviderRoutingPanel — AI provider routing and failover dashb…

f339bba

…oard (openclaw#325)

feat(evaluation-harness): add WORKSTREAM.md and basic benchmark test

c5efa2c

Delete WORKSTREAM.md

b3f7cc2

Merge branch 'dgarson/fork' into feat/evaluation-harness

8809b49

dgarson merged commit 8be0d62 into dgarson/fork Feb 24, 2026
2 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: kick off evaluation harness foundation#93

feat: kick off evaluation harness foundation#93
dgarson merged 11 commits intodgarson/forkfrom
feat/evaluation-harness

dgarson commented Feb 23, 2026 •

edited

Loading

Uh oh!

dgarson commented Feb 23, 2026

Uh oh!

dgarson commented Feb 23, 2026

Uh oh!

dgarson commented Feb 23, 2026

Uh oh!

dgarson commented Feb 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dgarson commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What was added

Validation

Next steps

Uh oh!

dgarson commented Feb 23, 2026

Uh oh!

dgarson commented Feb 23, 2026

Uh oh!

dgarson commented Feb 23, 2026

Progress Update (from Jerry)

eh-002 Complete ✓

eh-003 Seed (5-scenario target)

Gate Implications

Validation

Uh oh!

dgarson commented Feb 23, 2026

eh-003 Follow-up Complete (via PR #97)

What was added:

Scenario count:

Gate Impact:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dgarson commented Feb 23, 2026 •

edited

Loading