feat(tools): add structured tool error taxonomy with retry engine by bug-ops · Pull Request #2214 · bug-ops/zeph

bug-ops · 2026-03-27T08:03:12Z

Summary

Implements 12-category ToolErrorCategory enum in zeph-tools (arXiv:2601.16280) with category-specific retry/fallback strategies
Guarantees every tool_call_id always receives a corresponding tool_result, even on failure (arXiv:2509.25370)
Gates attempt_self_reflection on quality failures only — infrastructure errors (network, server, rate-limit) no longer trigger it
Adds [tools.retry] config section wired to runtime with config migration from old [agent].max_tool_retries

Changes

crates/zeph-tools/src/error_taxonomy.rs (new) — ToolErrorCategory, ToolErrorFeedback, classify_http_status, classify_io_error
crates/zeph-tools/src/executor.rs — ToolError::Http variant, retry engine integration
crates/zeph-tools/src/config.rs — RetryConfig struct with [tools.retry] section
crates/zeph-core/src/agent/session_config.rs — reads config.tools.retry.* (was config.agent.max_tool_retries)
crates/zeph-core/src/agent/tool_orchestrator.rs — retry_base_ms/retry_max_ms fields, with_retry_backoff() builder
crates/zeph-core/src/agent/tool_execution/native.rs — structured [tool_error] feedback, is_quality_failure() gate
crates/zeph-config/src/migrate.rs — migrate_agent_retry_to_tools_retry() migration step
src/init.rs — --init wizard steps for new retry config fields
config/default.toml — [tools.retry] defaults documented

Test plan

6147/6147 tests pass (cargo nextest run --workspace --lib --bins)
cargo clippy --workspace -- -D warnings clean
cargo +nightly fmt --check clean
HTTP 400/422 → InvalidParameters regression test present
io::ErrorKind::NotFound → PermanentFailure (not ToolNotFound) regression test present
Infrastructure errors do not trigger attempt_self_reflection regression test present
Config migration: [agent].max_tool_retries → [tools.retry].max_attempts
Live session test with fetch error to confirm no 400 cascade (verify bug(tools): parallel tool call with permanent error drops tool_result, causes 400 Bad Request on next LLM turn #2197 fix still holds)

, #2199) Implement 12-category ToolErrorCategory enum in zeph-tools based on arXiv:2601.16280, with category-specific retry/fallback strategies and guaranteed tool_result delivery for every tool_call_id regardless of failure mode (arXiv:2509.25370). Key changes: - Add ToolErrorCategory enum (11 variants: ToolNotFound, SchemaInvalid, InvalidParameters, TypeMismatch, PolicyBlocked, ConfirmationRequired, PermanentFailure, Cancelled, RateLimited, ServerError, NetworkError, Timeout) - Add ToolErrorFeedback struct with structured category, message, and suggestion - Add ToolError::Http { status, message } variant for fine-grained HTTP classification - Add [tools.retry] config section: max_attempts, base_ms, max_ms, budget_secs, parameter_reformat_provider (follows *_provider multi-model convention) - Wire RetryConfig to runtime: session_config reads config.tools.retry.*, retry_backoff_ms() parameterized with base_ms/max_ms from config - Gate attempt_self_reflection on is_quality_failure() — infrastructure errors (network, server, rate-limit) no longer trigger self-reflection - Config migration: migrate old [agent].max_tool_retries to [tools.retry] - --init wizard: new steps for retry_max_attempts and parameter_reformat_provider - TUI spinner: emit "Retrying tool: <name> (attempt N/M)..." status during retries Fixes: B2 (io::NotFound → PermanentFailure, not ToolNotFound) B4 (self-reflection gate with regression test) HTTP 400/422 → InvalidParameters classification

, #2206) Complements #2214 (ToolErrorCategory + ToolErrorFeedback) with two gaps that remained after that PR: **Shell executor exit-code classification** ShellExecutor now returns `ToolError::Shell { exit_code, category, message }` for well-known failure modes instead of embedding errors in the output string: - exit 126 (permission denied) → ToolErrorCategory::PolicyBlocked - exit 127 (command not found) → ToolErrorCategory::PermanentFailure - stderr containing "permission denied" → PolicyBlocked - stderr containing "no such file or directory" → PermanentFailure These errors flow through the existing `ToolErrorFeedback.format_for_llm()` injection in native.rs, producing structured `[tool_error]` blocks for the LLM. **Skill evolution FailureKind integration** `From<ToolErrorCategory> for FailureKind` maps the taxonomy to skill evolution signals: PolicyBlocked/ConfirmationRequired/ToolNotFound → WrongApproach, InvalidParameters/TypeMismatch → SyntaxError, Timeout → Timeout, infrastructure errors (RateLimited, ServerError, NetworkError, PermanentFailure, Cancelled) → Unknown. Eliminates string heuristic for classified tool failures. Closes #2207, closes #2206.

, #2206) (#2226) Complements #2214 (ToolErrorCategory + ToolErrorFeedback) with two gaps that remained after that PR: **Shell executor exit-code classification** ShellExecutor now returns `ToolError::Shell { exit_code, category, message }` for well-known failure modes instead of embedding errors in the output string: - exit 126 (permission denied) → ToolErrorCategory::PolicyBlocked - exit 127 (command not found) → ToolErrorCategory::PermanentFailure - stderr containing "permission denied" → PolicyBlocked - stderr containing "no such file or directory" → PermanentFailure These errors flow through the existing `ToolErrorFeedback.format_for_llm()` injection in native.rs, producing structured `[tool_error]` blocks for the LLM. **Skill evolution FailureKind integration** `From<ToolErrorCategory> for FailureKind` maps the taxonomy to skill evolution signals: PolicyBlocked/ConfirmationRequired/ToolNotFound → WrongApproach, InvalidParameters/TypeMismatch → SyntaxError, Timeout → Timeout, infrastructure errors (RateLimited, ServerError, NetworkError, PermanentFailure, Cancelled) → Unknown. Eliminates string heuristic for classified tool failures. Closes #2207, closes #2206.

github-actions bot added documentation Improvements or additions to documentation rust Rust code changes enhancement New feature or request core zeph-core crate config Configuration file changes size/XL Extra large PR (500+ lines) labels Mar 27, 2026

bug-ops linked an issue Mar 27, 2026 that may be closed by this pull request

research(reliability): AgentDebug — structured corrective feedback on tool failures to prevent context corruption (arXiv:2509.25370) #2199

Closed

bug-ops force-pushed the tool-error-taxonomy branch from 8d07f03 to 498aa07 Compare March 27, 2026 08:08

bug-ops enabled auto-merge (squash) March 27, 2026 08:10

bug-ops force-pushed the tool-error-taxonomy branch from 498aa07 to 513ce8e Compare March 27, 2026 08:21

bug-ops merged commit a0cf2e3 into main Mar 27, 2026
25 checks passed

bug-ops deleted the tool-error-taxonomy branch March 27, 2026 08:29

This was referenced Mar 27, 2026

bug(tools): ServerError suggestion says 'retry automatically' but retryable=false in LLM feedback #2222

Closed

bug(tools): Phase 2 transient retry does not fire for HTTP 503 fetch errors #2223

Closed

This was referenced Mar 27, 2026

feat(tools): shell exit-code taxonomy and FailureKind integration #2226

Merged

research: tool invocation reliability taxonomy — 12 categories, model-size threshold for reliable tool use (arXiv:2601.16280) #2234

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tools): add structured tool error taxonomy with retry engine#2214

feat(tools): add structured tool error taxonomy with retry engine#2214
bug-ops merged 1 commit intomainfrom
tool-error-taxonomy

bug-ops commented Mar 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bug-ops commented Mar 27, 2026

Summary

Changes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant