Skip to content

Message desync after long agent output (responses shifted by one) #52982

@alex-blocklab

Description

@alex-blocklab

Bug Description

After a long agent output, subsequent messages receive responses intended for the previous message. Sending another message then returns the response that should have been delivered to the first one. Responses are effectively shifted by one. Reproducible on both Discord and Telegram channels.

Version: openclaw 2026.3.13

Steps to Reproduce

  1. Send a message that triggers a long agent output (e.g., research task, code review)
  2. Wait for the response to be fully delivered
  3. Send a new message immediately after
  4. Observe: the response received is stale/old (from a previous context)
  5. Send another message — now receive the response that should have been delivered in step 4

Root Cause Analysis

Three interacting defects in the message processing pipeline create a race condition where response N gets delivered to message N+1:

Defect 1 (Primary): Telegram debouncer breaks sequentialize serialization

Location: discord-CcCLMjHw.js lines ~125364-125394, ~154182

The grammY sequentialize middleware serializes all updates per chat by holding a lock until the handler returns. However, the inbound debouncer's enqueue() method returns immediately when it decides to buffer a message, releasing the sequentialize lock before actual processing begins. Real processing happens later via setTimeout.

Result: Two messages for the same chat process concurrently, destroying ordering guarantees.

Sequence:

  1. Message A arrives → sequentialize acquires lock
  2. Message A enters debouncer → debouncer returns immediately (buffering) → lock released
  3. Message B arrives → sequentialize acquires lock (it's free now) → B enters processing
  4. Debouncer fires for A → A processes concurrently with B
  5. Responses may be swapped depending on completion order

Defect 2: Stale FOLLOWUP_RUN_CALLBACKS cause cross-message delivery

Location: lines ~78452-78460 (kickFollowupDrainIfIdle), ~78483 (scheduleFollowupDrain)

FOLLOWUP_RUN_CALLBACKS is a global Map keyed by session key. When message A's run finishes, finalizeWithFollowup stores A's runFollowupTurn callback — which closes over A's opts including opts.onBlockReply (A's reply dispatcher).

When a later message triggers kickFollowupDrainIfIdle, it retrieves the stale callback from A's context and uses it to drain the queue, routing responses through the wrong delivery pipeline.

Defect 3: finalizeWithFollowup starts drain before delivery completes

Location: lines ~120315-120317

const finalizeWithFollowup = (value, queueKey, runFollowupTurn) => {
    scheduleFollowupDrain(queueKey, runFollowupTurn);
    return value;
};

The drain is scheduled simultaneously with returning the payload. The next run begins before withReplyDispatcher has finished flushing the current run's delivery chain, creating a race between current delivery and followup processing.

Contributing: Command lane pump() ordering

Location: lines ~49546-49548

The next queued task starts executing before the current task's promise resolves, meaning the followup PI run can begin before the original message's delivery pipeline has been notified of completion.

Why Long Outputs Trigger It

  • Longer active-run window = higher probability the next user message arrives during response delivery
  • More block reply chunks in flight = more interleaving opportunities when followup drain starts concurrently
  • Debounce timing alignment = 1-second debounce window + serialization bypass makes the race near-certain after long outputs
  • Stale callback window grows = more time between setting and using FOLLOWUP_RUN_CALLBACKS

Suggested Fixes

  1. Fix debouncer serialization (critical): Make enqueue() return a promise that resolves after processing completes, not after buffering. This restores the sequentialize guarantee. Alternatively, move the debouncer inside the sequentialize-protected handler.

  2. Fix stale callbacks (critical): Store the runFollowupTurn callback on the followup queue item itself rather than in a separate global map. Or update FOLLOWUP_RUN_CALLBACKS at the start of each new run.

  3. Fix drain timing (important): Move scheduleFollowupDrain to execute after withReplyDispatcher completes all pending deliveries, not inside finalizeWithFollowup.

  4. Fix command lane ordering (defensive): Resolve the current entry's promise before calling pump() to start the next task.

All fixes are localized to the queuing and delivery infrastructure. Fix 1 alone would likely eliminate the bug for Telegram. Fix 2 addresses remaining edge cases on both platforms.

Environment

  • OpenClaw version: 2026.3.13
  • Platforms affected: Discord, Telegram
  • Node.js: v22.22.1
  • OS: Linux 6.8.0-100-generic (x64)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions