Skip to content

Commit 5a3e871

Browse files
committed
fix: decouple Discord inbound worker timeout from listener timeout (#36602) (thanks @dutifulbob)
1 parent b9a20dc commit 5a3e871

17 files changed

+1047
-253
lines changed

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -144,6 +144,7 @@ Docs: https://docs.openclaw.ai
144144
- Agents/failover service-unavailable handling: stop treating bare proxy/CDN `service unavailable` errors as provider overload while keeping them retryable via the timeout/failover path, so transient outages no longer show false rate-limit warnings or block fallback. (#36646) thanks @jnMetaCode.
145145
- Agents/current-time UTC anchor: append a machine-readable UTC suffix alongside local `Current time:` lines in shared cron-style prompt contexts so agents can compare UTC-stamped workspace timestamps without doing timezone math. (#32423) thanks @jriff.
146146
- TUI/webchat command-owner scope alignment: treat internal-channel gateway sessions with `operator.admin` as owner-authorized in command auth, restoring cron/gateway/connector tool access for affected TUI/webchat sessions while keeping external channels on identity-based owner checks. (from #35666, #35673, #35704) Thanks @Naylenv, @Octane0411, and @Sid-Qin.
147+
- Discord/inbound timeout isolation: separate inbound worker timeout tracking from listener timeout budgets so queued Discord replies are no longer dropped when listener watchdog windows expire mid-run. (#36602) Thanks @dutifulbob.
147148

148149
## 2026.3.2
149150

docs/channels/discord.md

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1102,12 +1102,19 @@ openclaw logs --follow
11021102

11031103
- `Listener DiscordMessageListener timed out after 30000ms for event MESSAGE_CREATE`
11041104
- `Slow listener detected ...`
1105+
- `discord inbound worker timed out after ...`
11051106

1106-
Canonical knob:
1107+
Listener budget knob:
11071108

11081109
- single-account: `channels.discord.eventQueue.listenerTimeout`
11091110
- multi-account: `channels.discord.accounts.<accountId>.eventQueue.listenerTimeout`
11101111

1112+
Worker run timeout knob:
1113+
1114+
- single-account: `channels.discord.inboundWorker.runTimeoutMs`
1115+
- multi-account: `channels.discord.accounts.<accountId>.inboundWorker.runTimeoutMs`
1116+
- default: `1800000` (30 minutes); set `0` to disable
1117+
11111118
Recommended baseline:
11121119

11131120
```json5
@@ -1119,14 +1126,18 @@ openclaw logs --follow
11191126
eventQueue: {
11201127
listenerTimeout: 120000,
11211128
},
1129+
inboundWorker: {
1130+
runTimeoutMs: 1800000,
1131+
},
11221132
},
11231133
},
11241134
},
11251135
},
11261136
}
11271137
```
11281138

1129-
Tune this first before adding alternate timeout controls elsewhere.
1139+
Use `eventQueue.listenerTimeout` for slow listener setup and `inboundWorker.runTimeoutMs`
1140+
only if you want a separate safety valve for queued agent turns.
11301141

11311142
</Accordion>
11321143

@@ -1177,7 +1188,8 @@ High-signal Discord fields:
11771188
- startup/auth: `enabled`, `token`, `accounts.*`, `allowBots`
11781189
- policy: `groupPolicy`, `dm.*`, `guilds.*`, `guilds.*.channels.*`
11791190
- command: `commands.native`, `commands.useAccessGroups`, `configWrites`, `slashCommand.*`
1180-
- event queue: `eventQueue.listenerTimeout` (canonical), `eventQueue.maxQueueSize`, `eventQueue.maxConcurrency`
1191+
- event queue: `eventQueue.listenerTimeout` (listener budget), `eventQueue.maxQueueSize`, `eventQueue.maxConcurrency`
1192+
- inbound worker: `inboundWorker.runTimeoutMs`
11811193
- reply/history: `replyToMode`, `historyLimit`, `dmHistoryLimit`, `dms.*.historyLimit`
11821194
- delivery: `textChunkLimit`, `chunkMode`, `maxLinesPerMessage`
11831195
- streaming: `streaming` (legacy alias: `streamMode`), `draftChunk`, `blockStreaming`, `blockStreamingCoalesce`
Lines changed: 337 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,337 @@
1+
---
2+
summary: "Status and next steps for decoupling Discord gateway listeners from long-running agent turns with a Discord-specific inbound worker"
3+
owner: "openclaw"
4+
status: "in_progress"
5+
last_updated: "2026-03-05"
6+
title: "Discord Async Inbound Worker Plan"
7+
---
8+
9+
# Discord Async Inbound Worker Plan
10+
11+
## Objective
12+
13+
Remove Discord listener timeout as a user-facing failure mode by making inbound Discord turns asynchronous:
14+
15+
1. Gateway listener accepts and normalizes inbound events quickly.
16+
2. A Discord run queue stores serialized jobs keyed by the same ordering boundary we use today.
17+
3. A worker executes the actual agent turn outside the Carbon listener lifetime.
18+
4. Replies are delivered back to the originating channel or thread after the run completes.
19+
20+
This is the long-term fix for queued Discord runs timing out at `channels.discord.eventQueue.listenerTimeout` while the agent run itself is still making progress.
21+
22+
## Current status
23+
24+
This plan is partially implemented.
25+
26+
Already done:
27+
28+
- Discord listener timeout and Discord run timeout are now separate settings.
29+
- Accepted inbound Discord turns are enqueued into `src/discord/monitor/inbound-worker.ts`.
30+
- The worker now owns the long-running turn instead of the Carbon listener.
31+
- Existing per-route ordering is preserved by queue key.
32+
- Timeout regression coverage exists for the Discord worker path.
33+
34+
What this means in plain language:
35+
36+
- the production timeout bug is fixed
37+
- the long-running turn no longer dies just because the Discord listener budget expires
38+
- the worker architecture is not finished yet
39+
40+
What is still missing:
41+
42+
- `DiscordInboundJob` is still only partially normalized and still carries live runtime references
43+
- command semantics (`stop`, `new`, `reset`, future session controls) are not yet fully worker-native
44+
- worker observability and operator status are still minimal
45+
- there is still no restart durability
46+
47+
## Why this exists
48+
49+
Current behavior ties the full agent turn to the listener lifetime:
50+
51+
- `src/discord/monitor/listeners.ts` applies the timeout and abort boundary.
52+
- `src/discord/monitor/message-handler.ts` keeps the queued run inside that boundary.
53+
- `src/discord/monitor/message-handler.process.ts` performs media loading, routing, dispatch, typing, draft streaming, and final reply delivery inline.
54+
55+
That architecture has two bad properties:
56+
57+
- long but healthy turns can be aborted by the listener watchdog
58+
- users can see no reply even when the downstream runtime would have produced one
59+
60+
Raising the timeout helps but does not change the failure mode.
61+
62+
## Non-goals
63+
64+
- Do not redesign non-Discord channels in this pass.
65+
- Do not broaden this into a generic all-channel worker framework in the first implementation.
66+
- Do not extract a shared cross-channel inbound worker abstraction yet; only share low-level primitives when duplication is obvious.
67+
- Do not add durable crash recovery in the first pass unless needed to land safely.
68+
- Do not change route selection, binding semantics, or ACP policy in this plan.
69+
70+
## Current constraints
71+
72+
The current Discord processing path still depends on some live runtime objects that should not stay inside the long-term job payload:
73+
74+
- Carbon `Client`
75+
- raw Discord event shapes
76+
- in-memory guild history map
77+
- thread binding manager callbacks
78+
- live typing and draft stream state
79+
80+
We already moved execution onto a worker queue, but the normalization boundary is still incomplete. Right now the worker is "run later in the same process with some of the same live objects," not a fully data-only job boundary.
81+
82+
## Target architecture
83+
84+
### 1. Listener stage
85+
86+
`DiscordMessageListener` remains the ingress point, but its job becomes:
87+
88+
- run preflight and policy checks
89+
- normalize accepted input into a serializable `DiscordInboundJob`
90+
- enqueue the job into a per-session or per-channel async queue
91+
- return immediately to Carbon once the enqueue succeeds
92+
93+
The listener should no longer own the end-to-end LLM turn lifetime.
94+
95+
### 2. Normalized job payload
96+
97+
Introduce a serializable job descriptor that contains only the data needed to run the turn later.
98+
99+
Minimum shape:
100+
101+
- route identity
102+
- `agentId`
103+
- `sessionKey`
104+
- `accountId`
105+
- `channel`
106+
- delivery identity
107+
- destination channel id
108+
- reply target message id
109+
- thread id if present
110+
- sender identity
111+
- sender id, label, username, tag
112+
- channel context
113+
- guild id
114+
- channel name or slug
115+
- thread metadata
116+
- resolved system prompt override
117+
- normalized message body
118+
- base text
119+
- effective message text
120+
- attachment descriptors or resolved media references
121+
- gating decisions
122+
- mention requirement outcome
123+
- command authorization outcome
124+
- bound session or agent metadata if applicable
125+
126+
The job payload must not contain live Carbon objects or mutable closures.
127+
128+
Current implementation status:
129+
130+
- partially done
131+
- `src/discord/monitor/inbound-job.ts` exists and defines the worker handoff
132+
- the payload still contains live Discord runtime context and should be reduced further
133+
134+
### 3. Worker stage
135+
136+
Add a Discord-specific worker runner responsible for:
137+
138+
- reconstructing the turn context from `DiscordInboundJob`
139+
- loading media and any additional channel metadata needed for the run
140+
- dispatching the agent turn
141+
- delivering final reply payloads
142+
- updating status and diagnostics
143+
144+
Recommended location:
145+
146+
- `src/discord/monitor/inbound-worker.ts`
147+
- `src/discord/monitor/inbound-job.ts`
148+
149+
### 4. Ordering model
150+
151+
Ordering must remain equivalent to today for a given route boundary.
152+
153+
Recommended key:
154+
155+
- use the same queue key logic as `resolveDiscordRunQueueKey(...)`
156+
157+
This preserves existing behavior:
158+
159+
- one bound agent conversation does not interleave with itself
160+
- different Discord channels can still progress independently
161+
162+
### 5. Timeout model
163+
164+
After cutover, there are two separate timeout classes:
165+
166+
- listener timeout
167+
- only covers normalization and enqueue
168+
- should be short
169+
- run timeout
170+
- optional, worker-owned, explicit, and user-visible
171+
- should not be inherited accidentally from Carbon listener settings
172+
173+
This removes the current accidental coupling between "Discord gateway listener stayed alive" and "agent run is healthy."
174+
175+
## Recommended implementation phases
176+
177+
### Phase 1: normalization boundary
178+
179+
- Status: partially implemented
180+
- Done:
181+
- extracted `buildDiscordInboundJob(...)`
182+
- added worker handoff tests
183+
- Remaining:
184+
- make `DiscordInboundJob` plain data only
185+
- move live runtime dependencies to worker-owned services instead of per-job payload
186+
- stop rebuilding process context by stitching live listener refs back into the job
187+
188+
### Phase 2: in-memory worker queue
189+
190+
- Status: implemented
191+
- Done:
192+
- added `DiscordInboundWorkerQueue` keyed by resolved run queue key
193+
- listener enqueues jobs instead of directly awaiting `processDiscordMessage(...)`
194+
- worker executes jobs in-process, in memory only
195+
196+
This is the first functional cutover.
197+
198+
### Phase 3: process split
199+
200+
- Status: not started
201+
- Move delivery, typing, and draft streaming ownership behind worker-facing adapters.
202+
- Replace direct use of live preflight context with worker context reconstruction.
203+
- Keep `processDiscordMessage(...)` temporarily as a facade if needed, then split it.
204+
205+
### Phase 4: command semantics
206+
207+
- Status: not started
208+
Make sure native Discord commands still behave correctly when work is queued:
209+
210+
- `stop`
211+
- `new`
212+
- `reset`
213+
- any future session-control commands
214+
215+
The worker queue must expose enough run state for commands to target the active or queued turn.
216+
217+
### Phase 5: observability and operator UX
218+
219+
- Status: not started
220+
- emit queue depth and active worker counts into monitor status
221+
- record enqueue time, start time, finish time, and timeout or cancellation reason
222+
- surface worker-owned timeout or delivery failures clearly in logs
223+
224+
### Phase 6: optional durability follow-up
225+
226+
- Status: not started
227+
Only after the in-memory version is stable:
228+
229+
- decide whether queued Discord jobs should survive gateway restart
230+
- if yes, persist job descriptors and delivery checkpoints
231+
- if no, document the explicit in-memory boundary
232+
233+
This should be a separate follow-up unless restart recovery is required to land.
234+
235+
## File impact
236+
237+
Current primary files:
238+
239+
- `src/discord/monitor/listeners.ts`
240+
- `src/discord/monitor/message-handler.ts`
241+
- `src/discord/monitor/message-handler.preflight.ts`
242+
- `src/discord/monitor/message-handler.process.ts`
243+
- `src/discord/monitor/status.ts`
244+
245+
Current worker files:
246+
247+
- `src/discord/monitor/inbound-job.ts`
248+
- `src/discord/monitor/inbound-worker.ts`
249+
- `src/discord/monitor/inbound-job.test.ts`
250+
- `src/discord/monitor/message-handler.queue.test.ts`
251+
252+
Likely next touch points:
253+
254+
- `src/auto-reply/dispatch.ts`
255+
- `src/discord/monitor/reply-delivery.ts`
256+
- `src/discord/monitor/thread-bindings.ts`
257+
- `src/discord/monitor/native-command.ts`
258+
259+
## Next step now
260+
261+
The next step is to make the worker boundary real instead of partial.
262+
263+
Do this next:
264+
265+
1. Move live runtime dependencies out of `DiscordInboundJob`
266+
2. Keep those dependencies on the Discord worker instance instead
267+
3. Reduce queued jobs to plain Discord-specific data:
268+
- route identity
269+
- delivery target
270+
- sender info
271+
- normalized message snapshot
272+
- gating and binding decisions
273+
4. Reconstruct worker execution context from that plain data inside the worker
274+
275+
In practice, that means:
276+
277+
- `client`
278+
- `threadBindings`
279+
- `guildHistories`
280+
- `discordRestFetch`
281+
- other mutable runtime-only handles
282+
283+
should stop living on each queued job and instead live on the worker itself or behind worker-owned adapters.
284+
285+
After that lands, the next follow-up should be command-state cleanup for `stop`, `new`, and `reset`.
286+
287+
## Testing plan
288+
289+
Keep the existing timeout repro coverage in:
290+
291+
- `src/discord/monitor/message-handler.queue.test.ts`
292+
293+
Add new tests for:
294+
295+
1. listener returns after enqueue without awaiting full turn
296+
2. per-route ordering is preserved
297+
3. different channels still run concurrently
298+
4. replies are delivered to the original message destination
299+
5. `stop` cancels the active worker-owned run
300+
6. worker failure produces visible diagnostics without blocking later jobs
301+
7. ACP-bound Discord channels still route correctly under worker execution
302+
303+
## Risks and mitigations
304+
305+
- Risk: command semantics drift from current synchronous behavior
306+
Mitigation: land command-state plumbing in the same cutover, not later
307+
308+
- Risk: reply delivery loses thread or reply-to context
309+
Mitigation: make delivery identity first-class in `DiscordInboundJob`
310+
311+
- Risk: duplicate sends during retries or queue restarts
312+
Mitigation: keep first pass in-memory only, or add explicit delivery idempotency before persistence
313+
314+
- Risk: `message-handler.process.ts` becomes harder to reason about during migration
315+
Mitigation: split into normalization, execution, and delivery helpers before or during worker cutover
316+
317+
## Acceptance criteria
318+
319+
The plan is complete when:
320+
321+
1. Discord listener timeout no longer aborts healthy long-running turns.
322+
2. Listener lifetime and agent-turn lifetime are separate concepts in code.
323+
3. Existing per-session ordering is preserved.
324+
4. ACP-bound Discord channels work through the same worker path.
325+
5. `stop` targets the worker-owned run instead of the old listener-owned call stack.
326+
6. Timeout and delivery failures become explicit worker outcomes, not silent listener drops.
327+
328+
## Remaining landing strategy
329+
330+
Finish this in follow-up PRs:
331+
332+
1. make `DiscordInboundJob` plain-data only and move live runtime refs onto the worker
333+
2. clean up command-state ownership for `stop`, `new`, and `reset`
334+
3. add worker observability and operator status
335+
4. decide whether durability is needed or explicitly document the in-memory boundary
336+
337+
This is still a bounded follow-up if kept Discord-only and if we continue to avoid a premature cross-channel worker abstraction.

0 commit comments

Comments
 (0)