-
Notifications
You must be signed in to change notification settings - Fork 14.8k
Explore subagent hangs indefinitely with Anthropic Claude Opus 4.6 -- no timeout or recovery #13841
Description
Bug Description
When OpenCode spawns an Explore Task subagent using Anthropic Claude Opus 4.6, the subagent frequently stalls after completing tool calls. The spinner keeps animating in the TUI but no new tool calls are made and no LLM response arrives. The session hangs forever with no timeout, retry, or error surfaced to the user.
Environment
- OpenCode 1.2.0
- macOS darwin arm64
- Provider: Anthropic Claude Opus 4.6
Reproduction Steps
- Use OpenCode 1.2.0 with Anthropic Claude Opus 4.6 as provider.
- Send a prompt that triggers an Explore Task subagent (e.g., "check some-repo for how X works").
- The subagent starts, makes some tool calls successfully.
- After a
step-finish, a newstep-startfires but the LLM API call never returns. - The TUI shows the spinner animating but
escinterrupt orpaction menu are the only escape -- no error, no timeout, no automatic recovery.
Observed Behavior (from session database)
- Subagent received user prompt at 10:05:48.
- Completed 2 steps with 8 tool calls by 10:06:00.
- Third
step-startfired at 10:06:02 with an empty text part. - No
step-finishever arrived -- the subagent hung for over 1 hour. - Spinner kept animating but no progress was made (toolcall count frozen at 8, then 0 after restart).
- After
Ctrl+Cand session resume, the same pattern repeated -- new Explore Task spawned but stalled again at 0 toolcalls.
Expected Behavior
- Subagent API calls should have a timeout (e.g. 2-5 minutes). On timeout, OpenCode should retry the API call or surface an error to the user.
- The TUI should distinguish "spinner animating but no progress" from "actively receiving data."
Root Cause Analysis
Investigation of the codebase reveals four compounding gaps that allow a stalled LLM stream to hang the session forever.
1. No default fetch-level timeout on LLM API requests
The provider-level fetch wrapper at provider.ts:1063-1101 only applies AbortSignal.timeout() when options["timeout"] is explicitly set:
// provider.ts:1068
if (options["timeout"] !== undefined && options["timeout"] !== null) {
// ... apply AbortSignal.timeout()
}The config schema at config.ts:981-995 documents a default of 300000ms (5 min), but this is documentation only -- the field is .optional() with no .default() call, and no default is applied during config normalization. When omitted, options["timeout"] is undefined and the guard skips entirely.
Critically, Bun's own socket-level timeout is also explicitly disabled (timeout: false at provider.ts:1099), so there is no fallback timeout of any kind when the user has not configured one.
2. No stream-idle watchdog (distinct from total request timeout)
Even if a total request timeout were applied via fix (1), it would not detect the specific failure mode observed here: a connection that stays open but stops delivering SSE chunks mid-stream.
The LLM.stream() call at llm.ts:211 passes only the session's AbortSignal -- there is no streamIdleTimeout or chunk-activity watchdog. Individual tools each have their own abortAfter() timeouts (bash: 2min, webfetch: 30-120s, websearch: 25s via abort.ts), but the LLM stream itself has none.
The heartbeat code in server.ts:520-532 is for SSE connections to the TUI WebView, not for LLM API streams.
A stream-idle timeout should reset on each received chunk, while a total request timeout is a separate hard ceiling. Both are needed.
3. No max-retry cap on session-level retries
There are two retry layers in the codebase, and their interaction matters:
- Provider-level retries (
maxRetries: input.retries ?? 0atllm.ts:228) -- capped at 0 for normal streaming calls. This layer is fine. - Session-level retries in
processor.ts:350-378--attempt++with no ceiling. As long asSessionRetry.retryable()returns a message, the loop continues forever with exponential backoff (2s base, 2x factor, capped at 30s per attempt).
Manual cancellation via SessionRetry.sleep(delay, input.abort) is the only escape. If a StreamIdleTimeoutError were introduced (fix 2) and marked isRetryable, this unbounded loop would cause repeated hangs rather than surfacing the error. See related #12234.
4. Subagent has no independent timeout
The Task tool at task.ts:128 calls SessionPrompt.prompt() and awaits the entire subagent run with no timeout wrapper. The abort cascade (task.ts:121-124) only fires when the parent session is manually aborted by the user. The subagent does not inherit any tool-level timeout from the tool executor.
Suggested Fix Direction
-
Apply the documented 300s default timeout at config normalization -- During config loading/merging, set
options.timeout = 300_000when the field isundefined(not when it isfalse). This ensuresAbortSignal.timeout()is always applied unless explicitly disabled. Prefer normalizing defaults in config over patching the fetch wrapper guard. -
Add a stream-idle timeout (separate from total request timeout) -- Implement a chunk-activity watchdog in
LLM.stream()or at the provider fetch wrapper level that resets on each SSE chunk. If no chunk arrives for e.g. 60-120s, throw a retryable error. This is distinct from the total request timeout in fix (1) and should be a separate configurable setting. -
Cap session-level retry attempts -- Add a
MAX_RETRY_ATTEMPTSconstant (e.g. 5-10) or amaxRetryMselapsed-time budget inprocessor.ts. Document the interaction with provider-levelmaxRetriesto avoid compounding. After the cap is reached, surface the error to the user instead of retrying. -
Add a subagent-level timeout -- Wrap the
SessionPrompt.prompt()call intask.tswithabortAfterAny(timeout, ctx.abort)(e.g. 10 min default, configurable). This ensures a hung subagent cannot block the parent session indefinitely without requiring manual intervention. -
TUI stall indicator (UX improvement, not root-cause fix) -- Track the timestamp of the last received SSE chunk and change the spinner style or show a warning after prolonged inactivity (e.g. "> 60s since last data"). This is complementary to the above fixes, not a substitute.
Related Issues
- Tasks/Subagents with Codex / OpenAI are frequently getting stuck with no timeout/retry, which then hangs the session forever #11865 -- Same symptom with Codex/OpenAI; confirms this is provider-agnostic
- Subagent AI_APICallError is silently swallowed, causing session to hang #13395 -- Subagent
AI_APICallErrorsilently swallowed, causing parent to hang - main agent hangs because of subagent (explore) #9003 -- Main agent hangs because of subagent (explore)
- TUI: Parent session appears stuck "loading" when subagent is blocked (waiting user input / hanging tool call); lack of visibility and recovery UX #10802 -- TUI parent session appears stuck "loading" when subagent is blocked
- Infinite retry loop when StreamIdleTimeoutError occurs during tool input generation #12234 -- Infinite retry loop on
StreamIdleTimeoutError(no max retry cap) - Task Tool Timeouts & Early Termination in Multi-Agent Conductor Pattern #6792 -- Task tool timeouts & early termination in multi-agent conductor pattern