Skip to content

Session lane can deadlock indefinitely after LLM timeout, blocking all subsequent messages #7630

@13004545

Description

@13004545

Bug Description

After an LLM request times out (180s), the session lane can get stuck indefinitely. Subsequent messages only enqueue but never get processed (dequeue). This requires a gateway restart to recover.

Steps to Reproduce

  1. Send a message that triggers a long-running LLM request
  2. Wait for the request to timeout (180s) with FailoverError: LLM request timed out.
  3. Immediately after the timeout, if certain hooks are running (e.g., /new command with slug generation), the next task in the session lane starts but never completes
  4. All subsequent messages pile up in the queue indefinitely

Log Evidence

12:31:14 lane task error: lane=session:qq:dm:... error="FailoverError: LLM request timed out."
12:31:14 lane enqueue: lane=session:qq:dm:... queueSize=1
12:31:14 lane dequeue: lane=session:qq:dm:... queueSize=0
# ^^^ This dequeued task NEVER completes - no "lane task done" or "lane task error"

12:37:14 lane enqueue: lane=session:qq:dm:... queueSize=2
12:38:21 lane enqueue: lane=session:qq:dm:... queueSize=3
12:38:32 lane enqueue: lane=session:qq:dm:... queueSize=4
12:42:32 lane enqueue: lane=session:qq:dm:... queueSize=5
# Queue keeps growing, no dequeue ever happens again

Meanwhile, other lanes (cron, etc.) continue working normally, showing this is a per-session deadlock, not a global hang.

Root Cause Analysis

The nested queue pattern in run.js:

return enqueueSession(() => enqueueGlobal(async () => { ... }));

If the inner task (after session lane dequeue) encounters an unhandled exception or a Promise that never resolves, the session lane's active count is never decremented, blocking all subsequent messages.

Environment

  • Version: 2026.1.24-3
  • Channel: QQ (custom plugin)
  • OS: macOS

Workaround

Added a 5-minute timeout to command-queue.js that forcibly releases the lane if a task doesn't complete:

const LANE_TASK_TIMEOUT_MS = 5 * 60 * 1000;
// ... timeout logic that calls state.active -= 1 and pump() after timeout

Suggested Fix

  1. Add a built-in timeout for lane tasks (configurable)
  2. Investigate why certain message processing chains (especially /new command hooks) can leave tasks in a hung state
  3. Consider adding a lane health check that detects and recovers from stuck lanes

Labels: bug, reliability


Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleMarked as stale due to inactivity

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions