Explore subagent hangs indefinitely with Anthropic Claude Opus 4.6 -- no timeout or recovery

## Bug Description

When OpenCode spawns an Explore Task subagent using Anthropic Claude Opus 4.6, the subagent frequently stalls after completing tool calls. The spinner keeps animating in the TUI but no new tool calls are made and no LLM response arrives. The session hangs forever with no timeout, retry, or error surfaced to the user.

## Environment

- OpenCode 1.2.0
- macOS darwin arm64
- Provider: Anthropic Claude Opus 4.6

## Reproduction Steps

1. Use OpenCode 1.2.0 with Anthropic Claude Opus 4.6 as provider.
2. Send a prompt that triggers an Explore Task subagent (e.g., "check some-repo for how X works").
3. The subagent starts, makes some tool calls successfully.
4. After a `step-finish`, a new `step-start` fires but the LLM API call never returns.
5. The TUI shows the spinner animating but `esc` interrupt or `p` action menu are the only escape -- no error, no timeout, no automatic recovery.

## Observed Behavior (from session database)

- Subagent received user prompt at 10:05:48.
- Completed 2 steps with 8 tool calls by 10:06:00.
- Third `step-start` fired at 10:06:02 with an empty text part.
- No `step-finish` ever arrived -- the subagent hung for over 1 hour.
- Spinner kept animating but no progress was made (toolcall count frozen at 8, then 0 after restart).
- After `Ctrl+C` and session resume, the same pattern repeated -- new Explore Task spawned but stalled again at 0 toolcalls.

## Expected Behavior

- Subagent API calls should have a timeout (e.g. 2-5 minutes). On timeout, OpenCode should retry the API call or surface an error to the user.
- The TUI should distinguish "spinner animating but no progress" from "actively receiving data."

## Root Cause Analysis

Investigation of the codebase reveals **four compounding gaps** that allow a stalled LLM stream to hang the session forever.

### 1. No default fetch-level timeout on LLM API requests

The provider-level fetch wrapper at [`provider.ts:1063-1101`](packages/opencode/src/provider/provider.ts) only applies `AbortSignal.timeout()` when `options["timeout"]` is explicitly set:

```ts
// provider.ts:1068
if (options["timeout"] !== undefined && options["timeout"] !== null) {
  // ... apply AbortSignal.timeout()
}
```

The config schema at [`config.ts:981-995`](packages/opencode/src/config/config.ts) documents a default of 300000ms (5 min), but this is **documentation only** -- the field is `.optional()` with no `.default()` call, and no default is applied during config normalization. When omitted, `options["timeout"]` is `undefined` and the guard skips entirely.

Critically, Bun's own socket-level timeout is also explicitly disabled (`timeout: false` at `provider.ts:1099`), so **there is no fallback timeout of any kind** when the user has not configured one.

### 2. No stream-idle watchdog (distinct from total request timeout)

Even if a total request timeout were applied via fix (1), it would not detect the specific failure mode observed here: a connection that stays open but stops delivering SSE chunks mid-stream.

The `LLM.stream()` call at [`llm.ts:211`](packages/opencode/src/session/llm.ts) passes only the session's `AbortSignal` -- there is no `streamIdleTimeout` or chunk-activity watchdog. Individual tools each have their own `abortAfter()` timeouts (bash: 2min, webfetch: 30-120s, websearch: 25s via [`abort.ts`](packages/opencode/src/util/abort.ts)), but the LLM stream itself has none.

The heartbeat code in [`server.ts:520-532`](packages/opencode/src/server/server.ts) is for SSE connections to the TUI WebView, not for LLM API streams.

A stream-idle timeout should **reset on each received chunk**, while a total request timeout is a separate hard ceiling. Both are needed.

### 3. No max-retry cap on session-level retries

There are **two retry layers** in the codebase, and their interaction matters:

- **Provider-level retries** (`maxRetries: input.retries ?? 0` at [`llm.ts:228`](packages/opencode/src/session/llm.ts)) -- capped at 0 for normal streaming calls. This layer is fine.
- **Session-level retries** in [`processor.ts:350-378`](packages/opencode/src/session/processor.ts) -- `attempt++` with **no ceiling**. As long as `SessionRetry.retryable()` returns a message, the loop continues forever with exponential backoff (2s base, 2x factor, capped at 30s per attempt).

Manual cancellation via `SessionRetry.sleep(delay, input.abort)` is the only escape. If a `StreamIdleTimeoutError` were introduced (fix 2) and marked `isRetryable`, this unbounded loop would cause repeated hangs rather than surfacing the error. See related #12234.

### 4. Subagent has no independent timeout

The Task tool at [`task.ts:128`](packages/opencode/src/tool/task.ts) calls `SessionPrompt.prompt()` and `await`s the entire subagent run with no timeout wrapper. The abort cascade (`task.ts:121-124`) only fires when the **parent** session is manually aborted by the user. The subagent does not inherit any tool-level timeout from the tool executor.

## Suggested Fix Direction

1. **Apply the documented 300s default timeout at config normalization** -- During config loading/merging, set `options.timeout = 300_000` when the field is `undefined` (not when it is `false`). This ensures `AbortSignal.timeout()` is always applied unless explicitly disabled. Prefer normalizing defaults in config over patching the fetch wrapper guard.

2. **Add a stream-idle timeout (separate from total request timeout)** -- Implement a chunk-activity watchdog in `LLM.stream()` or at the provider fetch wrapper level that resets on each SSE chunk. If no chunk arrives for e.g. 60-120s, throw a retryable error. This is distinct from the total request timeout in fix (1) and should be a separate configurable setting.

3. **Cap session-level retry attempts** -- Add a `MAX_RETRY_ATTEMPTS` constant (e.g. 5-10) or a `maxRetryMs` elapsed-time budget in `processor.ts`. Document the interaction with provider-level `maxRetries` to avoid compounding. After the cap is reached, surface the error to the user instead of retrying.

4. **Add a subagent-level timeout** -- Wrap the `SessionPrompt.prompt()` call in `task.ts` with `abortAfterAny(timeout, ctx.abort)` (e.g. 10 min default, configurable). This ensures a hung subagent cannot block the parent session indefinitely without requiring manual intervention.

5. **TUI stall indicator (UX improvement, not root-cause fix)** -- Track the timestamp of the last received SSE chunk and change the spinner style or show a warning after prolonged inactivity (e.g. "> 60s since last data"). This is complementary to the above fixes, not a substitute.

## Related Issues

- #11865 -- Same symptom with Codex/OpenAI; confirms this is provider-agnostic
- #13395 -- Subagent `AI_APICallError` silently swallowed, causing parent to hang
- #9003 -- Main agent hangs because of subagent (explore)
- #10802 -- TUI parent session appears stuck "loading" when subagent is blocked
- #12234 -- Infinite retry loop on `StreamIdleTimeoutError` (no max retry cap)
- #6792 -- Task tool timeouts & early termination in multi-agent conductor pattern

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore subagent hangs indefinitely with Anthropic Claude Opus 4.6 -- no timeout or recovery #13841

Bug Description

Environment

Reproduction Steps

Observed Behavior (from session database)

Expected Behavior

Root Cause Analysis

1. No default fetch-level timeout on LLM API requests

2. No stream-idle watchdog (distinct from total request timeout)

3. No max-retry cap on session-level retries

4. Subagent has no independent timeout

Suggested Fix Direction

Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Explore subagent hangs indefinitely with Anthropic Claude Opus 4.6 -- no timeout or recovery #13841

Description

Bug Description

Environment

Reproduction Steps

Observed Behavior (from session database)

Expected Behavior

Root Cause Analysis

1. No default fetch-level timeout on LLM API requests

2. No stream-idle watchdog (distinct from total request timeout)

3. No max-retry cap on session-level retries

4. Subagent has no independent timeout

Suggested Fix Direction

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions