Skip to content

Track Codex app-server terminal notification hardening #75205

@steipete

Description

@steipete

Summary

OpenClaw hit a stuck Discord reply lane after Codex app-server accepted a turn/start request and then failed to deliver the terminal turn/completed or abort notification OpenClaw was waiting for. The active embedded-run handle stayed registered, so diagnostics correctly treated the session as an active run and did not release the lane.

Observed stuck session key: agent:main:discord:channel:1456744319972282449
Observed run/session id: b5e075cc-bf19-4f91-83e3-79e32f338bb5
OpenClaw workaround commit: 54e6e3d7daf5d0d857edf756b35628a29d11c7f5

What OpenClaw did

  • Added a Codex app-server terminal-progress watchdog after turn/start returns an in-progress turn.
  • The watchdog resets on Codex app-server notifications and request/response activity.
  • If a Codex turn remains silent before any terminal event, OpenClaw marks the attempt timed out, sends best-effort turn/interrupt, resolves the attempt, clears the active embedded-run handle, and releases the session lane.
  • Added regression coverage for the accepted-but-silent turn case in extensions/codex/src/app-server/run-attempt.test.ts.
  • Documented the behavior in the agent-loop and queue docs.

What Codex should fix

  • turn/start should not accept work unless the app-server listener/subscription path is healthy for that conversation.
  • App-server should guarantee a terminal notification (turn/completed, turn/aborted, or an explicit terminal error event) for every accepted turn/start, even when the underlying Responses/SSE stream idles or fails.
  • App-server should expose enough read-back state for clients to reconcile an accepted turn after listener failure, for example via thread/read turn status or a dedicated active-turn status endpoint.
  • Listener failure and SSE idle timeout paths should be surfaced as terminal app-server events, not just internal logs.

Evidence from code read

Codex app-server currently treats turn/start as accepted once it submits Op::UserInput; the turn lifecycle notifications depend on the separate listener path reading conversation.next_event and translating TurnStarted/TurnComplete/TurnAborted into app-server events. That creates a gap where OpenClaw can receive turn/start success but never receive the terminal notification it needs to release the channel lane.

Relevant Codex paths inspected locally:

  • codex-rs/app-server/src/codex_message_processor.rs: turn/start, thread/start, thread/resume, listener loop.
  • codex-rs/app-server/src/bespoke_event_handling.rs: mapping core turn lifecycle events into app-server notifications.
  • codex-rs/core/src/tasks/regular.rs and codex-rs/core/src/tasks/mod.rs: core TurnStarted, TurnComplete, and TurnAborted emission.

Verification

  • pnpm test extensions/codex/src/app-server/run-attempt.test.ts passed locally: 40 tests.
  • pnpm exec oxfmt --check --threads=1 extensions/codex/src/app-server/run-attempt.ts extensions/codex/src/app-server/run-attempt.test.ts docs/concepts/agent-loop.md docs/concepts/queue.md passed locally.
  • git diff --check origin/main...HEAD passed after rebase.
  • Testbox pnpm check:changed passed for lanes extensions, extensionTests, and docs.

Metadata

Metadata

Assignees

Labels

maintainerMaintainer-authored PR

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions