Skip to content

[Enhancement]: Lane backpressure management + tool call validation for multi-agent webchat sessions #11286

@ukr-coder

Description

@ukr-coder

Summary

Implement lane backpressure management for multi-agent webchat sessions to prevent UI freezes during heavy agent-to-agent exchanges. Currently, when nested agents exchange large payloads (18k+ tokens), the session lane queue blocks new messages for 2-4+ minutes with no user feedback, causing browser "wait or leave page" dialogs.

Related bug report: #11284

Problem

In multi-agent setups (e.g., a coordinator agent + a Claude Code subagent, both on Opus 4.5), heavy exchanges create a cascade:

  1. Agent A sends a large request (~18k tokens input)
  2. Agent B generates a large response (~6k tokens output, streamed over 30-60+ seconds)
  3. During streaming, any new messages queue in the session lane (queueAhead=1)
  4. The queue wait exceeds 4 minutes (waitedMs=248826)
  5. The webchat UI freezes — no spinner, no progress indicator, no "agents are collaborating" message
  6. Browser shows "wait or leave page" dialog
  7. User has no way to know if the system is working or stuck

This is distinct from #7725 (Ollama timeout causing zombie state) — here the system is functioning correctly but the UX is broken.

Proposed solution

1. Webchat UI: streaming activity indicator during lane wait

When queueAhead > 0 and a run is active on the lane, the webchat UI should show a live indicator:

🔄 Agents collaborating... (2 of 3 messages processing)
   CC is generating a response (~45s elapsed)
   [Cancel] [View partial output]

Implementation hint: The gateway already emits lane wait exceeded diagnostics. Expose this as a WebSocket event (e.g., lane.status) that the Control UI subscribes to. The UI can then show a non-blocking overlay instead of letting the browser think the page is hung.

2. Lane: configurable per-session concurrency for nested agents

Currently, nested agent sessions share a single lane, serializing all messages. For multi-agent setups, allow configurable concurrency:

{
  "agents": {
    "defaults": {
      "lane": {
        "maxConcurrent": 1,
        "warnThresholdMs": 30000,
        "nestedAgentParallel": false
      }
    }
  }
}
  • warnThresholdMs: Emit a user-visible warning when lane wait exceeds this (default: 30s)
  • nestedAgentParallel: When true, allow nested agent runs to execute in parallel rather than serialized (experimental, opt-in)

3. Tool call validation: fail-fast on missing required parameters

The read tool requires a path parameter. When an agent issues read with argsType=object but no path, the gateway should:

  1. Return a tool_result error to the calling agent (not just log a warning)
  2. Include a helpful message: "Error: read tool requires 'path' parameter. Received: {}"
  3. Allow the agent to retry rather than silently continuing without the data

This is a separate but related issue — malformed tool calls during heavy exchanges suggest the model may be under context pressure and needs the error signal to self-correct.

4. WebSocket keepalive during long runs

The webchat WebSocket connection should send periodic heartbeat frames during long-running lane tasks to prevent the browser from treating the connection as stale:

[Every 15s during active run]
→ ws ping (or custom "heartbeat" event with elapsed time)
← ws pong

This prevents browser-level "page unresponsive" dialogs that are triggered by lack of WebSocket activity.

Alternatives considered

Approach Pros Cons
Do nothing Zero effort Multi-agent webchat is broken for heavy workloads
UI-only fix (option 1+4) Low effort, high UX impact Doesn't solve underlying serialization
Full lane refactor (option 2) Solves root cause Complex, risk of race conditions
This proposal (1+3+4) Moderate effort, addresses both UX and data integrity Option 2 deferred to later

Recommended implementation order

  1. Tool call validation (option 3) — smallest change, highest correctness impact
  2. WebSocket keepalive (option 4) — small change, prevents browser dialogs
  3. UI activity indicator (option 1) — medium change, biggest UX improvement
  4. Lane concurrency (option 2) — future enhancement, needs design discussion

Environment where observed

  • OpenClaw 2026.2.3-1, Windows 11, Node.js v24.13.0
  • Two nested agents on Opus 4.5 via Claude Max
  • Webchat in Brave browser
  • Lane wait times: 123-248 seconds
  • Gateway RAM: ~470 MB (single node process running 6+ hours)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions