Harden forced client disconnect cleanup#3885
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2a105b8b91
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if force: | ||
| await asyncio.shield(session_task) |
There was a problem hiding this comment.
Prevent reconnect from reusing a shutting-down session task
Using await asyncio.shield(session_task) here means a cancelled Client.close() no longer cancels the background session task; _disconnect(force=True) can then exit with stop_event already set while session_task is still running. In that window, _connect() treats the task as active (session_task is not None and not done) and returns without starting a fresh session, so a concurrent reconnect can attach to a session that is in teardown and soon resets self._session_state.session to None, causing follow-up client calls to fail. This regression is specific to forced-close cancellation races introduced by shielding.
Useful? React with 👍 / 👎.
jlowin
left a comment
There was a problem hiding this comment.
Thanks @vonbai — the diagnosis here is right. The unshielded await self._session_state.session_task does propagate caller-cancellation into the session task itself, cutting short _session_runner's ability to process stop_event, unwind its AsyncExitStack, and close the transport cleanly. That's worth fixing.
On the asyncio-vs-anyio question: session_task being an asyncio.Task is intentional — there's an existing comment at client.py:561-563 explaining that it has to outlive individual context manager scopes, which anyio's structured concurrency won't allow. So asyncio.shield + asyncio.CancelledError on this specific task is the right family of primitives, not an oversight.
The blocker: Codex's P1 race is real
The shield prevents session_task from being cancelled when the caller is cancelled mid-await. Combined with catching CancelledError for the force path, that opens a window where:
- Caller is cancelled during
await asyncio.shield(session_task) CancelledErroris caught and swallowed (force + task not cancelled)finally:runs —session_task.done()is False (still unwinding in the background), so the field isn't cleared_disconnectreturns, lock releases- Concurrent
_connect()acquires the lock, seessession_task is not None and not session_task.done(), treats the dying session as live, incrementsnesting_counter, and doesn't start a fresh runner - Original session task finishes, cleanup runs, transport resets
- The "reused" session is stale — subsequent calls fail
The pre-fix behavior avoided this because an unshielded cancelled await would cancel session_task too, making .done() True before the lock released. The shield breaks that invariant.
The cleaner fix: match the existing pattern in this same function
There's already a working shielded-wait-with-timeout pattern a few lines up at client.py:570-579 for the connection-startup cancellation case:
with anyio.CancelScope(shield=True):
with anyio.move_on_after(3):
try:
await session_task
except asyncio.CancelledError:
passUnlike asyncio.shield, anyio.CancelScope(shield=True) actually waits for the inner await to complete rather than re-raising cancellation immediately. By the time you release the lock, session_task.done() is True and the concurrent-reconnect race is closed. It also keeps this function to a single cancellation pattern instead of two.
Applied to the disconnect path:
session_task = self._session_state.session_task
self._session_state.stop_event.set()
try:
if force:
with anyio.CancelScope(shield=True):
with anyio.move_on_after(self._disconnect_timeout):
try:
await session_task
except asyncio.CancelledError:
pass
else:
await session_task
finally:
self._session_state.session_task = NonePlease also add a regression test covering the race directly: spawn a force-close, cancel it mid-await, then immediately attempt a reconnect, and assert the new session actually works (i.e., _connect started a fresh runner rather than reusing the dying one). The existing test_multi_client_force_close doesn't exercise the cancellation path.
Other notes
- Your Validation section mentions the focused test "still emitted the existing subprocess/pytest-timeout warning shape as a warning, not a failure." Could you characterize that warning? If it's pre-existing noise unrelated to this change, say so explicitly; if it's new or related, the fix should address it.
- CI isn't running yet (only label triage is green). I'll approve workflows once the above is addressed.
Generated with Codex.
|
Thanks, this was exactly the missing edge case. I pushed On the warning: I rechecked it in a separate Validation run on the updated branch:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8fcc26eb4b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| else: | ||
| await session_task | ||
| finally: | ||
| self._session_state.session_task = None |
There was a problem hiding this comment.
Keep session task tracked until shutdown actually finishes
This finally clears self._session_state.session_task even when the force-disconnect wait exits via move_on_after(...) timeout, so _connect() can start a new runner before the previous runner has fully unwound. If the old runner continues cleanup (for example, a transport that shields or retries cancellation during connect_session teardown), it can still execute _context_manager's finally and call _reset_session_state(), which can null out the newly established session. That creates an intermittent reconnect failure where later client calls see Client is not connected despite a successful reconnect.
Useful? React with 👍 / 👎.
* Harden forced client disconnect cleanup * Handle cancelled force-close waits Generated with Codex. --------- Co-authored-by: Jeremiah Lowin <[email protected]>
This fixes forced
Client.close()cleanup when caller cancellation arrives while the client is waiting for the background session task to unwind. The forced disconnect path now uses the sameanyio.CancelScope(shield=True)+move_on_after()pattern used during connection-startup cancellation, then clearssession_taskbefore releasing the session lock so a concurrent reconnect cannot reuse a dying runner.The regression test pauses session teardown, cancels a force-close mid-wait, immediately attempts a reconnect, and asserts the reconnect gets a fresh session task and can call
list_tools().Validation:
uv run pytest tests/client/client/test_client.py::test_force_close_cancelled_wait_starts_fresh_session -q, focused neighboring cancellation tests,uv run pytest tests/client/client/test_client.py -q,uv run pytest tests/test_mcp_config.py::test_multi_client_force_close -q,uv run pytest -n auto tests/test_mcp_config.py,uv run pytest -n auto, anduv run prek run --all-files.Warning note: the full
tests/test_mcp_config.pyrun under xdist still emits onePytestUnraisableExceptionWarningfromBaseSubprocessTransport.__del__intest_multi_client_lifespan. I checked anupstream/mainbaseline in a separate worktree: the same subprocess/event-loop-closed warning class is already present there, and the baseline still failstest_multi_client_force_closewith the original cancellation bug. The focused force-close test now passes without that warning.