Skip to content

Explore subagent hangs indefinitely with Anthropic Claude Opus 4.6 -- no timeout or recovery #13841

@timvw

Description

@timvw

Bug Description

When OpenCode spawns an Explore Task subagent using Anthropic Claude Opus 4.6, the subagent frequently stalls after completing tool calls. The spinner keeps animating in the TUI but no new tool calls are made and no LLM response arrives. The session hangs forever with no timeout, retry, or error surfaced to the user.

Environment

  • OpenCode 1.2.0
  • macOS darwin arm64
  • Provider: Anthropic Claude Opus 4.6

Reproduction Steps

  1. Use OpenCode 1.2.0 with Anthropic Claude Opus 4.6 as provider.
  2. Send a prompt that triggers an Explore Task subagent (e.g., "check some-repo for how X works").
  3. The subagent starts, makes some tool calls successfully.
  4. After a step-finish, a new step-start fires but the LLM API call never returns.
  5. The TUI shows the spinner animating but esc interrupt or p action menu are the only escape -- no error, no timeout, no automatic recovery.

Observed Behavior (from session database)

  • Subagent received user prompt at 10:05:48.
  • Completed 2 steps with 8 tool calls by 10:06:00.
  • Third step-start fired at 10:06:02 with an empty text part.
  • No step-finish ever arrived -- the subagent hung for over 1 hour.
  • Spinner kept animating but no progress was made (toolcall count frozen at 8, then 0 after restart).
  • After Ctrl+C and session resume, the same pattern repeated -- new Explore Task spawned but stalled again at 0 toolcalls.

Expected Behavior

  • Subagent API calls should have a timeout (e.g. 2-5 minutes). On timeout, OpenCode should retry the API call or surface an error to the user.
  • The TUI should distinguish "spinner animating but no progress" from "actively receiving data."

Root Cause Analysis

Investigation of the codebase reveals four compounding gaps that allow a stalled LLM stream to hang the session forever.

1. No default fetch-level timeout on LLM API requests

The provider-level fetch wrapper at provider.ts:1063-1101 only applies AbortSignal.timeout() when options["timeout"] is explicitly set:

// provider.ts:1068
if (options["timeout"] !== undefined && options["timeout"] !== null) {
  // ... apply AbortSignal.timeout()
}

The config schema at config.ts:981-995 documents a default of 300000ms (5 min), but this is documentation only -- the field is .optional() with no .default() call, and no default is applied during config normalization. When omitted, options["timeout"] is undefined and the guard skips entirely.

Critically, Bun's own socket-level timeout is also explicitly disabled (timeout: false at provider.ts:1099), so there is no fallback timeout of any kind when the user has not configured one.

2. No stream-idle watchdog (distinct from total request timeout)

Even if a total request timeout were applied via fix (1), it would not detect the specific failure mode observed here: a connection that stays open but stops delivering SSE chunks mid-stream.

The LLM.stream() call at llm.ts:211 passes only the session's AbortSignal -- there is no streamIdleTimeout or chunk-activity watchdog. Individual tools each have their own abortAfter() timeouts (bash: 2min, webfetch: 30-120s, websearch: 25s via abort.ts), but the LLM stream itself has none.

The heartbeat code in server.ts:520-532 is for SSE connections to the TUI WebView, not for LLM API streams.

A stream-idle timeout should reset on each received chunk, while a total request timeout is a separate hard ceiling. Both are needed.

3. No max-retry cap on session-level retries

There are two retry layers in the codebase, and their interaction matters:

  • Provider-level retries (maxRetries: input.retries ?? 0 at llm.ts:228) -- capped at 0 for normal streaming calls. This layer is fine.
  • Session-level retries in processor.ts:350-378 -- attempt++ with no ceiling. As long as SessionRetry.retryable() returns a message, the loop continues forever with exponential backoff (2s base, 2x factor, capped at 30s per attempt).

Manual cancellation via SessionRetry.sleep(delay, input.abort) is the only escape. If a StreamIdleTimeoutError were introduced (fix 2) and marked isRetryable, this unbounded loop would cause repeated hangs rather than surfacing the error. See related #12234.

4. Subagent has no independent timeout

The Task tool at task.ts:128 calls SessionPrompt.prompt() and awaits the entire subagent run with no timeout wrapper. The abort cascade (task.ts:121-124) only fires when the parent session is manually aborted by the user. The subagent does not inherit any tool-level timeout from the tool executor.

Suggested Fix Direction

  1. Apply the documented 300s default timeout at config normalization -- During config loading/merging, set options.timeout = 300_000 when the field is undefined (not when it is false). This ensures AbortSignal.timeout() is always applied unless explicitly disabled. Prefer normalizing defaults in config over patching the fetch wrapper guard.

  2. Add a stream-idle timeout (separate from total request timeout) -- Implement a chunk-activity watchdog in LLM.stream() or at the provider fetch wrapper level that resets on each SSE chunk. If no chunk arrives for e.g. 60-120s, throw a retryable error. This is distinct from the total request timeout in fix (1) and should be a separate configurable setting.

  3. Cap session-level retry attempts -- Add a MAX_RETRY_ATTEMPTS constant (e.g. 5-10) or a maxRetryMs elapsed-time budget in processor.ts. Document the interaction with provider-level maxRetries to avoid compounding. After the cap is reached, surface the error to the user instead of retrying.

  4. Add a subagent-level timeout -- Wrap the SessionPrompt.prompt() call in task.ts with abortAfterAny(timeout, ctx.abort) (e.g. 10 min default, configurable). This ensures a hung subagent cannot block the parent session indefinitely without requiring manual intervention.

  5. TUI stall indicator (UX improvement, not root-cause fix) -- Track the timestamp of the last received SSE chunk and change the spinner style or show a warning after prolonged inactivity (e.g. "> 60s since last data"). This is complementary to the above fixes, not a substitute.

Related Issues

Metadata

Metadata

Assignees

Labels

perfIndicates a performance issue or need for optimization

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions