-
-
Notifications
You must be signed in to change notification settings - Fork 69.6k
Message desync after long agent output (responses shifted by one) #52982
Description
Bug Description
After a long agent output, subsequent messages receive responses intended for the previous message. Sending another message then returns the response that should have been delivered to the first one. Responses are effectively shifted by one. Reproducible on both Discord and Telegram channels.
Version: openclaw 2026.3.13
Steps to Reproduce
- Send a message that triggers a long agent output (e.g., research task, code review)
- Wait for the response to be fully delivered
- Send a new message immediately after
- Observe: the response received is stale/old (from a previous context)
- Send another message — now receive the response that should have been delivered in step 4
Root Cause Analysis
Three interacting defects in the message processing pipeline create a race condition where response N gets delivered to message N+1:
Defect 1 (Primary): Telegram debouncer breaks sequentialize serialization
Location: discord-CcCLMjHw.js lines ~125364-125394, ~154182
The grammY sequentialize middleware serializes all updates per chat by holding a lock until the handler returns. However, the inbound debouncer's enqueue() method returns immediately when it decides to buffer a message, releasing the sequentialize lock before actual processing begins. Real processing happens later via setTimeout.
Result: Two messages for the same chat process concurrently, destroying ordering guarantees.
Sequence:
- Message A arrives → sequentialize acquires lock
- Message A enters debouncer → debouncer returns immediately (buffering) → lock released
- Message B arrives → sequentialize acquires lock (it's free now) → B enters processing
- Debouncer fires for A → A processes concurrently with B
- Responses may be swapped depending on completion order
Defect 2: Stale FOLLOWUP_RUN_CALLBACKS cause cross-message delivery
Location: lines ~78452-78460 (kickFollowupDrainIfIdle), ~78483 (scheduleFollowupDrain)
FOLLOWUP_RUN_CALLBACKS is a global Map keyed by session key. When message A's run finishes, finalizeWithFollowup stores A's runFollowupTurn callback — which closes over A's opts including opts.onBlockReply (A's reply dispatcher).
When a later message triggers kickFollowupDrainIfIdle, it retrieves the stale callback from A's context and uses it to drain the queue, routing responses through the wrong delivery pipeline.
Defect 3: finalizeWithFollowup starts drain before delivery completes
Location: lines ~120315-120317
const finalizeWithFollowup = (value, queueKey, runFollowupTurn) => {
scheduleFollowupDrain(queueKey, runFollowupTurn);
return value;
};The drain is scheduled simultaneously with returning the payload. The next run begins before withReplyDispatcher has finished flushing the current run's delivery chain, creating a race between current delivery and followup processing.
Contributing: Command lane pump() ordering
Location: lines ~49546-49548
The next queued task starts executing before the current task's promise resolves, meaning the followup PI run can begin before the original message's delivery pipeline has been notified of completion.
Why Long Outputs Trigger It
- Longer active-run window = higher probability the next user message arrives during response delivery
- More block reply chunks in flight = more interleaving opportunities when followup drain starts concurrently
- Debounce timing alignment = 1-second debounce window + serialization bypass makes the race near-certain after long outputs
- Stale callback window grows = more time between setting and using
FOLLOWUP_RUN_CALLBACKS
Suggested Fixes
-
Fix debouncer serialization (critical): Make
enqueue()return a promise that resolves after processing completes, not after buffering. This restores thesequentializeguarantee. Alternatively, move the debouncer inside the sequentialize-protected handler. -
Fix stale callbacks (critical): Store the
runFollowupTurncallback on the followup queue item itself rather than in a separate global map. Or updateFOLLOWUP_RUN_CALLBACKSat the start of each new run. -
Fix drain timing (important): Move
scheduleFollowupDrainto execute afterwithReplyDispatchercompletes all pending deliveries, not insidefinalizeWithFollowup. -
Fix command lane ordering (defensive): Resolve the current entry's promise before calling
pump()to start the next task.
All fixes are localized to the queuing and delivery infrastructure. Fix 1 alone would likely eliminate the bug for Telegram. Fix 2 addresses remaining edge cases on both platforms.
Environment
- OpenClaw version: 2026.3.13
- Platforms affected: Discord, Telegram
- Node.js: v22.22.1
- OS: Linux 6.8.0-100-generic (x64)