|
| 1 | +--- |
| 2 | +summary: "Status and next steps for decoupling Discord gateway listeners from long-running agent turns with a Discord-specific inbound worker" |
| 3 | +owner: "openclaw" |
| 4 | +status: "in_progress" |
| 5 | +last_updated: "2026-03-05" |
| 6 | +title: "Discord Async Inbound Worker Plan" |
| 7 | +--- |
| 8 | + |
| 9 | +# Discord Async Inbound Worker Plan |
| 10 | + |
| 11 | +## Objective |
| 12 | + |
| 13 | +Remove Discord listener timeout as a user-facing failure mode by making inbound Discord turns asynchronous: |
| 14 | + |
| 15 | +1. Gateway listener accepts and normalizes inbound events quickly. |
| 16 | +2. A Discord run queue stores serialized jobs keyed by the same ordering boundary we use today. |
| 17 | +3. A worker executes the actual agent turn outside the Carbon listener lifetime. |
| 18 | +4. Replies are delivered back to the originating channel or thread after the run completes. |
| 19 | + |
| 20 | +This is the long-term fix for queued Discord runs timing out at `channels.discord.eventQueue.listenerTimeout` while the agent run itself is still making progress. |
| 21 | + |
| 22 | +## Current status |
| 23 | + |
| 24 | +This plan is partially implemented. |
| 25 | + |
| 26 | +Already done: |
| 27 | + |
| 28 | +- Discord listener timeout and Discord run timeout are now separate settings. |
| 29 | +- Accepted inbound Discord turns are enqueued into `src/discord/monitor/inbound-worker.ts`. |
| 30 | +- The worker now owns the long-running turn instead of the Carbon listener. |
| 31 | +- Existing per-route ordering is preserved by queue key. |
| 32 | +- Timeout regression coverage exists for the Discord worker path. |
| 33 | + |
| 34 | +What this means in plain language: |
| 35 | + |
| 36 | +- the production timeout bug is fixed |
| 37 | +- the long-running turn no longer dies just because the Discord listener budget expires |
| 38 | +- the worker architecture is not finished yet |
| 39 | + |
| 40 | +What is still missing: |
| 41 | + |
| 42 | +- `DiscordInboundJob` is still only partially normalized and still carries live runtime references |
| 43 | +- command semantics (`stop`, `new`, `reset`, future session controls) are not yet fully worker-native |
| 44 | +- worker observability and operator status are still minimal |
| 45 | +- there is still no restart durability |
| 46 | + |
| 47 | +## Why this exists |
| 48 | + |
| 49 | +Current behavior ties the full agent turn to the listener lifetime: |
| 50 | + |
| 51 | +- `src/discord/monitor/listeners.ts` applies the timeout and abort boundary. |
| 52 | +- `src/discord/monitor/message-handler.ts` keeps the queued run inside that boundary. |
| 53 | +- `src/discord/monitor/message-handler.process.ts` performs media loading, routing, dispatch, typing, draft streaming, and final reply delivery inline. |
| 54 | + |
| 55 | +That architecture has two bad properties: |
| 56 | + |
| 57 | +- long but healthy turns can be aborted by the listener watchdog |
| 58 | +- users can see no reply even when the downstream runtime would have produced one |
| 59 | + |
| 60 | +Raising the timeout helps but does not change the failure mode. |
| 61 | + |
| 62 | +## Non-goals |
| 63 | + |
| 64 | +- Do not redesign non-Discord channels in this pass. |
| 65 | +- Do not broaden this into a generic all-channel worker framework in the first implementation. |
| 66 | +- Do not extract a shared cross-channel inbound worker abstraction yet; only share low-level primitives when duplication is obvious. |
| 67 | +- Do not add durable crash recovery in the first pass unless needed to land safely. |
| 68 | +- Do not change route selection, binding semantics, or ACP policy in this plan. |
| 69 | + |
| 70 | +## Current constraints |
| 71 | + |
| 72 | +The current Discord processing path still depends on some live runtime objects that should not stay inside the long-term job payload: |
| 73 | + |
| 74 | +- Carbon `Client` |
| 75 | +- raw Discord event shapes |
| 76 | +- in-memory guild history map |
| 77 | +- thread binding manager callbacks |
| 78 | +- live typing and draft stream state |
| 79 | + |
| 80 | +We already moved execution onto a worker queue, but the normalization boundary is still incomplete. Right now the worker is "run later in the same process with some of the same live objects," not a fully data-only job boundary. |
| 81 | + |
| 82 | +## Target architecture |
| 83 | + |
| 84 | +### 1. Listener stage |
| 85 | + |
| 86 | +`DiscordMessageListener` remains the ingress point, but its job becomes: |
| 87 | + |
| 88 | +- run preflight and policy checks |
| 89 | +- normalize accepted input into a serializable `DiscordInboundJob` |
| 90 | +- enqueue the job into a per-session or per-channel async queue |
| 91 | +- return immediately to Carbon once the enqueue succeeds |
| 92 | + |
| 93 | +The listener should no longer own the end-to-end LLM turn lifetime. |
| 94 | + |
| 95 | +### 2. Normalized job payload |
| 96 | + |
| 97 | +Introduce a serializable job descriptor that contains only the data needed to run the turn later. |
| 98 | + |
| 99 | +Minimum shape: |
| 100 | + |
| 101 | +- route identity |
| 102 | + - `agentId` |
| 103 | + - `sessionKey` |
| 104 | + - `accountId` |
| 105 | + - `channel` |
| 106 | +- delivery identity |
| 107 | + - destination channel id |
| 108 | + - reply target message id |
| 109 | + - thread id if present |
| 110 | +- sender identity |
| 111 | + - sender id, label, username, tag |
| 112 | +- channel context |
| 113 | + - guild id |
| 114 | + - channel name or slug |
| 115 | + - thread metadata |
| 116 | + - resolved system prompt override |
| 117 | +- normalized message body |
| 118 | + - base text |
| 119 | + - effective message text |
| 120 | + - attachment descriptors or resolved media references |
| 121 | +- gating decisions |
| 122 | + - mention requirement outcome |
| 123 | + - command authorization outcome |
| 124 | + - bound session or agent metadata if applicable |
| 125 | + |
| 126 | +The job payload must not contain live Carbon objects or mutable closures. |
| 127 | + |
| 128 | +Current implementation status: |
| 129 | + |
| 130 | +- partially done |
| 131 | +- `src/discord/monitor/inbound-job.ts` exists and defines the worker handoff |
| 132 | +- the payload still contains live Discord runtime context and should be reduced further |
| 133 | + |
| 134 | +### 3. Worker stage |
| 135 | + |
| 136 | +Add a Discord-specific worker runner responsible for: |
| 137 | + |
| 138 | +- reconstructing the turn context from `DiscordInboundJob` |
| 139 | +- loading media and any additional channel metadata needed for the run |
| 140 | +- dispatching the agent turn |
| 141 | +- delivering final reply payloads |
| 142 | +- updating status and diagnostics |
| 143 | + |
| 144 | +Recommended location: |
| 145 | + |
| 146 | +- `src/discord/monitor/inbound-worker.ts` |
| 147 | +- `src/discord/monitor/inbound-job.ts` |
| 148 | + |
| 149 | +### 4. Ordering model |
| 150 | + |
| 151 | +Ordering must remain equivalent to today for a given route boundary. |
| 152 | + |
| 153 | +Recommended key: |
| 154 | + |
| 155 | +- use the same queue key logic as `resolveDiscordRunQueueKey(...)` |
| 156 | + |
| 157 | +This preserves existing behavior: |
| 158 | + |
| 159 | +- one bound agent conversation does not interleave with itself |
| 160 | +- different Discord channels can still progress independently |
| 161 | + |
| 162 | +### 5. Timeout model |
| 163 | + |
| 164 | +After cutover, there are two separate timeout classes: |
| 165 | + |
| 166 | +- listener timeout |
| 167 | + - only covers normalization and enqueue |
| 168 | + - should be short |
| 169 | +- run timeout |
| 170 | + - optional, worker-owned, explicit, and user-visible |
| 171 | + - should not be inherited accidentally from Carbon listener settings |
| 172 | + |
| 173 | +This removes the current accidental coupling between "Discord gateway listener stayed alive" and "agent run is healthy." |
| 174 | + |
| 175 | +## Recommended implementation phases |
| 176 | + |
| 177 | +### Phase 1: normalization boundary |
| 178 | + |
| 179 | +- Status: partially implemented |
| 180 | +- Done: |
| 181 | + - extracted `buildDiscordInboundJob(...)` |
| 182 | + - added worker handoff tests |
| 183 | +- Remaining: |
| 184 | + - make `DiscordInboundJob` plain data only |
| 185 | + - move live runtime dependencies to worker-owned services instead of per-job payload |
| 186 | + - stop rebuilding process context by stitching live listener refs back into the job |
| 187 | + |
| 188 | +### Phase 2: in-memory worker queue |
| 189 | + |
| 190 | +- Status: implemented |
| 191 | +- Done: |
| 192 | + - added `DiscordInboundWorkerQueue` keyed by resolved run queue key |
| 193 | + - listener enqueues jobs instead of directly awaiting `processDiscordMessage(...)` |
| 194 | + - worker executes jobs in-process, in memory only |
| 195 | + |
| 196 | +This is the first functional cutover. |
| 197 | + |
| 198 | +### Phase 3: process split |
| 199 | + |
| 200 | +- Status: not started |
| 201 | +- Move delivery, typing, and draft streaming ownership behind worker-facing adapters. |
| 202 | +- Replace direct use of live preflight context with worker context reconstruction. |
| 203 | +- Keep `processDiscordMessage(...)` temporarily as a facade if needed, then split it. |
| 204 | + |
| 205 | +### Phase 4: command semantics |
| 206 | + |
| 207 | +- Status: not started |
| 208 | + Make sure native Discord commands still behave correctly when work is queued: |
| 209 | + |
| 210 | +- `stop` |
| 211 | +- `new` |
| 212 | +- `reset` |
| 213 | +- any future session-control commands |
| 214 | + |
| 215 | +The worker queue must expose enough run state for commands to target the active or queued turn. |
| 216 | + |
| 217 | +### Phase 5: observability and operator UX |
| 218 | + |
| 219 | +- Status: not started |
| 220 | +- emit queue depth and active worker counts into monitor status |
| 221 | +- record enqueue time, start time, finish time, and timeout or cancellation reason |
| 222 | +- surface worker-owned timeout or delivery failures clearly in logs |
| 223 | + |
| 224 | +### Phase 6: optional durability follow-up |
| 225 | + |
| 226 | +- Status: not started |
| 227 | + Only after the in-memory version is stable: |
| 228 | + |
| 229 | +- decide whether queued Discord jobs should survive gateway restart |
| 230 | +- if yes, persist job descriptors and delivery checkpoints |
| 231 | +- if no, document the explicit in-memory boundary |
| 232 | + |
| 233 | +This should be a separate follow-up unless restart recovery is required to land. |
| 234 | + |
| 235 | +## File impact |
| 236 | + |
| 237 | +Current primary files: |
| 238 | + |
| 239 | +- `src/discord/monitor/listeners.ts` |
| 240 | +- `src/discord/monitor/message-handler.ts` |
| 241 | +- `src/discord/monitor/message-handler.preflight.ts` |
| 242 | +- `src/discord/monitor/message-handler.process.ts` |
| 243 | +- `src/discord/monitor/status.ts` |
| 244 | + |
| 245 | +Current worker files: |
| 246 | + |
| 247 | +- `src/discord/monitor/inbound-job.ts` |
| 248 | +- `src/discord/monitor/inbound-worker.ts` |
| 249 | +- `src/discord/monitor/inbound-job.test.ts` |
| 250 | +- `src/discord/monitor/message-handler.queue.test.ts` |
| 251 | + |
| 252 | +Likely next touch points: |
| 253 | + |
| 254 | +- `src/auto-reply/dispatch.ts` |
| 255 | +- `src/discord/monitor/reply-delivery.ts` |
| 256 | +- `src/discord/monitor/thread-bindings.ts` |
| 257 | +- `src/discord/monitor/native-command.ts` |
| 258 | + |
| 259 | +## Next step now |
| 260 | + |
| 261 | +The next step is to make the worker boundary real instead of partial. |
| 262 | + |
| 263 | +Do this next: |
| 264 | + |
| 265 | +1. Move live runtime dependencies out of `DiscordInboundJob` |
| 266 | +2. Keep those dependencies on the Discord worker instance instead |
| 267 | +3. Reduce queued jobs to plain Discord-specific data: |
| 268 | + - route identity |
| 269 | + - delivery target |
| 270 | + - sender info |
| 271 | + - normalized message snapshot |
| 272 | + - gating and binding decisions |
| 273 | +4. Reconstruct worker execution context from that plain data inside the worker |
| 274 | + |
| 275 | +In practice, that means: |
| 276 | + |
| 277 | +- `client` |
| 278 | +- `threadBindings` |
| 279 | +- `guildHistories` |
| 280 | +- `discordRestFetch` |
| 281 | +- other mutable runtime-only handles |
| 282 | + |
| 283 | +should stop living on each queued job and instead live on the worker itself or behind worker-owned adapters. |
| 284 | + |
| 285 | +After that lands, the next follow-up should be command-state cleanup for `stop`, `new`, and `reset`. |
| 286 | + |
| 287 | +## Testing plan |
| 288 | + |
| 289 | +Keep the existing timeout repro coverage in: |
| 290 | + |
| 291 | +- `src/discord/monitor/message-handler.queue.test.ts` |
| 292 | + |
| 293 | +Add new tests for: |
| 294 | + |
| 295 | +1. listener returns after enqueue without awaiting full turn |
| 296 | +2. per-route ordering is preserved |
| 297 | +3. different channels still run concurrently |
| 298 | +4. replies are delivered to the original message destination |
| 299 | +5. `stop` cancels the active worker-owned run |
| 300 | +6. worker failure produces visible diagnostics without blocking later jobs |
| 301 | +7. ACP-bound Discord channels still route correctly under worker execution |
| 302 | + |
| 303 | +## Risks and mitigations |
| 304 | + |
| 305 | +- Risk: command semantics drift from current synchronous behavior |
| 306 | + Mitigation: land command-state plumbing in the same cutover, not later |
| 307 | + |
| 308 | +- Risk: reply delivery loses thread or reply-to context |
| 309 | + Mitigation: make delivery identity first-class in `DiscordInboundJob` |
| 310 | + |
| 311 | +- Risk: duplicate sends during retries or queue restarts |
| 312 | + Mitigation: keep first pass in-memory only, or add explicit delivery idempotency before persistence |
| 313 | + |
| 314 | +- Risk: `message-handler.process.ts` becomes harder to reason about during migration |
| 315 | + Mitigation: split into normalization, execution, and delivery helpers before or during worker cutover |
| 316 | + |
| 317 | +## Acceptance criteria |
| 318 | + |
| 319 | +The plan is complete when: |
| 320 | + |
| 321 | +1. Discord listener timeout no longer aborts healthy long-running turns. |
| 322 | +2. Listener lifetime and agent-turn lifetime are separate concepts in code. |
| 323 | +3. Existing per-session ordering is preserved. |
| 324 | +4. ACP-bound Discord channels work through the same worker path. |
| 325 | +5. `stop` targets the worker-owned run instead of the old listener-owned call stack. |
| 326 | +6. Timeout and delivery failures become explicit worker outcomes, not silent listener drops. |
| 327 | + |
| 328 | +## Remaining landing strategy |
| 329 | + |
| 330 | +Finish this in follow-up PRs: |
| 331 | + |
| 332 | +1. make `DiscordInboundJob` plain-data only and move live runtime refs onto the worker |
| 333 | +2. clean up command-state ownership for `stop`, `new`, and `reset` |
| 334 | +3. add worker observability and operator status |
| 335 | +4. decide whether durability is needed or explicitly document the in-memory boundary |
| 336 | + |
| 337 | +This is still a bounded follow-up if kept Discord-only and if we continue to avoid a premature cross-channel worker abstraction. |
0 commit comments