Skip to content

feat(tools): add structured tool error taxonomy with retry engine#2214

Merged
bug-ops merged 1 commit intomainfrom
tool-error-taxonomy
Mar 27, 2026
Merged

feat(tools): add structured tool error taxonomy with retry engine#2214
bug-ops merged 1 commit intomainfrom
tool-error-taxonomy

Conversation

@bug-ops
Copy link
Copy Markdown
Owner

@bug-ops bug-ops commented Mar 27, 2026

Summary

  • Implements 12-category ToolErrorCategory enum in zeph-tools (arXiv:2601.16280) with category-specific retry/fallback strategies
  • Guarantees every tool_call_id always receives a corresponding tool_result, even on failure (arXiv:2509.25370)
  • Gates attempt_self_reflection on quality failures only — infrastructure errors (network, server, rate-limit) no longer trigger it
  • Adds [tools.retry] config section wired to runtime with config migration from old [agent].max_tool_retries

Closes #2203, #2199

Changes

  • crates/zeph-tools/src/error_taxonomy.rs (new) — ToolErrorCategory, ToolErrorFeedback, classify_http_status, classify_io_error
  • crates/zeph-tools/src/executor.rsToolError::Http variant, retry engine integration
  • crates/zeph-tools/src/config.rsRetryConfig struct with [tools.retry] section
  • crates/zeph-core/src/agent/session_config.rs — reads config.tools.retry.* (was config.agent.max_tool_retries)
  • crates/zeph-core/src/agent/tool_orchestrator.rsretry_base_ms/retry_max_ms fields, with_retry_backoff() builder
  • crates/zeph-core/src/agent/tool_execution/native.rs — structured [tool_error] feedback, is_quality_failure() gate
  • crates/zeph-config/src/migrate.rsmigrate_agent_retry_to_tools_retry() migration step
  • src/init.rs--init wizard steps for new retry config fields
  • config/default.toml[tools.retry] defaults documented

Test plan

  • 6147/6147 tests pass (cargo nextest run --workspace --lib --bins)
  • cargo clippy --workspace -- -D warnings clean
  • cargo +nightly fmt --check clean
  • HTTP 400/422 → InvalidParameters regression test present
  • io::ErrorKind::NotFoundPermanentFailure (not ToolNotFound) regression test present
  • Infrastructure errors do not trigger attempt_self_reflection regression test present
  • Config migration: [agent].max_tool_retries[tools.retry].max_attempts
  • Live session test with fetch error to confirm no 400 cascade (verify bug(tools): parallel tool call with permanent error drops tool_result, causes 400 Bad Request on next LLM turn #2197 fix still holds)

@github-actions github-actions bot added documentation Improvements or additions to documentation rust Rust code changes enhancement New feature or request core zeph-core crate config Configuration file changes size/XL Extra large PR (500+ lines) labels Mar 27, 2026
@bug-ops bug-ops force-pushed the tool-error-taxonomy branch from 8d07f03 to 498aa07 Compare March 27, 2026 08:08
@bug-ops bug-ops enabled auto-merge (squash) March 27, 2026 08:10
, #2199)

Implement 12-category ToolErrorCategory enum in zeph-tools based on arXiv:2601.16280,
with category-specific retry/fallback strategies and guaranteed tool_result delivery
for every tool_call_id regardless of failure mode (arXiv:2509.25370).

Key changes:
- Add ToolErrorCategory enum (11 variants: ToolNotFound, SchemaInvalid,
  InvalidParameters, TypeMismatch, PolicyBlocked, ConfirmationRequired,
  PermanentFailure, Cancelled, RateLimited, ServerError, NetworkError, Timeout)
- Add ToolErrorFeedback struct with structured category, message, and suggestion
- Add ToolError::Http { status, message } variant for fine-grained HTTP classification
- Add [tools.retry] config section: max_attempts, base_ms, max_ms, budget_secs,
  parameter_reformat_provider (follows *_provider multi-model convention)
- Wire RetryConfig to runtime: session_config reads config.tools.retry.*,
  retry_backoff_ms() parameterized with base_ms/max_ms from config
- Gate attempt_self_reflection on is_quality_failure() — infrastructure errors
  (network, server, rate-limit) no longer trigger self-reflection
- Config migration: migrate old [agent].max_tool_retries to [tools.retry]
- --init wizard: new steps for retry_max_attempts and parameter_reformat_provider
- TUI spinner: emit "Retrying tool: <name> (attempt N/M)..." status during retries

Fixes: B2 (io::NotFound → PermanentFailure, not ToolNotFound)
       B4 (self-reflection gate with regression test)
       HTTP 400/422 → InvalidParameters classification
@bug-ops bug-ops force-pushed the tool-error-taxonomy branch from 498aa07 to 513ce8e Compare March 27, 2026 08:21
@bug-ops bug-ops merged commit a0cf2e3 into main Mar 27, 2026
25 checks passed
@bug-ops bug-ops deleted the tool-error-taxonomy branch March 27, 2026 08:29
bug-ops added a commit that referenced this pull request Mar 27, 2026
, #2206)

Complements #2214 (ToolErrorCategory + ToolErrorFeedback) with two gaps that
remained after that PR:

**Shell executor exit-code classification**
ShellExecutor now returns `ToolError::Shell { exit_code, category, message }`
for well-known failure modes instead of embedding errors in the output string:
- exit 126 (permission denied) → ToolErrorCategory::PolicyBlocked
- exit 127 (command not found) → ToolErrorCategory::PermanentFailure
- stderr containing "permission denied" → PolicyBlocked
- stderr containing "no such file or directory" → PermanentFailure

These errors flow through the existing `ToolErrorFeedback.format_for_llm()`
injection in native.rs, producing structured `[tool_error]` blocks for the LLM.

**Skill evolution FailureKind integration**
`From<ToolErrorCategory> for FailureKind` maps the taxonomy to skill evolution
signals: PolicyBlocked/ConfirmationRequired/ToolNotFound → WrongApproach,
InvalidParameters/TypeMismatch → SyntaxError, Timeout → Timeout, infrastructure
errors (RateLimited, ServerError, NetworkError, PermanentFailure, Cancelled) →
Unknown. Eliminates string heuristic for classified tool failures.

Closes #2207, closes #2206.
bug-ops added a commit that referenced this pull request Mar 27, 2026
, #2206)

Complements #2214 (ToolErrorCategory + ToolErrorFeedback) with two gaps that
remained after that PR:

**Shell executor exit-code classification**
ShellExecutor now returns `ToolError::Shell { exit_code, category, message }`
for well-known failure modes instead of embedding errors in the output string:
- exit 126 (permission denied) → ToolErrorCategory::PolicyBlocked
- exit 127 (command not found) → ToolErrorCategory::PermanentFailure
- stderr containing "permission denied" → PolicyBlocked
- stderr containing "no such file or directory" → PermanentFailure

These errors flow through the existing `ToolErrorFeedback.format_for_llm()`
injection in native.rs, producing structured `[tool_error]` blocks for the LLM.

**Skill evolution FailureKind integration**
`From<ToolErrorCategory> for FailureKind` maps the taxonomy to skill evolution
signals: PolicyBlocked/ConfirmationRequired/ToolNotFound → WrongApproach,
InvalidParameters/TypeMismatch → SyntaxError, Timeout → Timeout, infrastructure
errors (RateLimited, ServerError, NetworkError, PermanentFailure, Cancelled) →
Unknown. Eliminates string heuristic for classified tool failures.

Closes #2207, closes #2206.
bug-ops added a commit that referenced this pull request Mar 27, 2026
, #2206) (#2226)

Complements #2214 (ToolErrorCategory + ToolErrorFeedback) with two gaps that
remained after that PR:

**Shell executor exit-code classification**
ShellExecutor now returns `ToolError::Shell { exit_code, category, message }`
for well-known failure modes instead of embedding errors in the output string:
- exit 126 (permission denied) → ToolErrorCategory::PolicyBlocked
- exit 127 (command not found) → ToolErrorCategory::PermanentFailure
- stderr containing "permission denied" → PolicyBlocked
- stderr containing "no such file or directory" → PermanentFailure

These errors flow through the existing `ToolErrorFeedback.format_for_llm()`
injection in native.rs, producing structured `[tool_error]` blocks for the LLM.

**Skill evolution FailureKind integration**
`From<ToolErrorCategory> for FailureKind` maps the taxonomy to skill evolution
signals: PolicyBlocked/ConfirmationRequired/ToolNotFound → WrongApproach,
InvalidParameters/TypeMismatch → SyntaxError, Timeout → Timeout, infrastructure
errors (RateLimited, ServerError, NetworkError, PermanentFailure, Cancelled) →
Unknown. Eliminates string heuristic for classified tool failures.

Closes #2207, closes #2206.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

config Configuration file changes core zeph-core crate documentation Improvements or additions to documentation enhancement New feature or request rust Rust code changes size/XL Extra large PR (500+ lines)

Projects

None yet

1 participant