-
Notifications
You must be signed in to change notification settings - Fork 2
bug(tools): Phase 2 transient retry does not fire for HTTP 503 fetch errors #2223
Description
Description
PR #2214 added Phase 2 sequential retry for transient failures on retryable executors. According to the code:
ToolError::Http { status: 503 }→ServerError→Transient(retryable)WebScrapeExecutor::is_tool_retryable("fetch")returnstrue- Default
max_attempts = 2→max_retries = 2 > 0→ Phase 2 should enter
However, in live testing (CI-207) with fetch(https://httpbin.org/status/503), no Phase 2 retry fires:
- No WARN "transient tool error, retrying with backoff" appears at any log level (debug, trace)
- No 500ms backoff delay observed
- Feedback immediately shows
retryable: falsewithout retry attempt
Reproduction
Config: testing.toml (default [tools.retry] max_attempts=2).
Prompt: Fetch https://httpbin.org/status/503 and tell me what happened. Then what is 11 times 9?
Expected: WARN log "transient tool error, retrying with backoff tool=fetch attempt=1 delay_ms=500" visible
Actual: No retry log, immediate retryable: false feedback
Investigation
- Code path is correct:
ToolError::Http { status: 503 }.kind() = Transient - Phase 2 guard
if max_retries > 0should pass - Possible causes:
is_tool_retryable_erased("fetch")returnsfalsein runtime (executor chain delegation issue)tool_results[idx]isOk(Some(...))notErr(...)at Phase 2 time (pre-exec block path)- The two parallel fetch calls (LLM issued parallel tool_calls) interact with Phase 2 in unexpected way
Severity
Medium — transient retries are a core reliability feature of PR #2214. If they don't fire, the retry engine provides no practical benefit for HTTP-based tools.