Harden forced client disconnect cleanup by vonbai · Pull Request #3885 · PrefectHQ/fastmcp

vonbai · 2026-04-13T04:44:47Z

This fixes forced Client.close() cleanup when caller cancellation arrives while the client is waiting for the background session task to unwind. The forced disconnect path now uses the same anyio.CancelScope(shield=True) + move_on_after() pattern used during connection-startup cancellation, then clears session_task before releasing the session lock so a concurrent reconnect cannot reuse a dying runner.

The regression test pauses session teardown, cancels a force-close mid-wait, immediately attempts a reconnect, and asserts the reconnect gets a fresh session task and can call list_tools().

Validation: uv run pytest tests/client/client/test_client.py::test_force_close_cancelled_wait_starts_fresh_session -q, focused neighboring cancellation tests, uv run pytest tests/client/client/test_client.py -q, uv run pytest tests/test_mcp_config.py::test_multi_client_force_close -q, uv run pytest -n auto tests/test_mcp_config.py, uv run pytest -n auto, and uv run prek run --all-files.

Warning note: the full tests/test_mcp_config.py run under xdist still emits one PytestUnraisableExceptionWarning from BaseSubprocessTransport.__del__ in test_multi_client_lifespan. I checked an upstream/main baseline in a separate worktree: the same subprocess/event-loop-closed warning class is already present there, and the baseline still fails test_multi_client_force_close with the original cancellation bug. The focused force-close test now passes without that warning.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2a105b8b91

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-13T04:48:19Z

+                if force:
+                    await asyncio.shield(session_task)


Prevent reconnect from reusing a shutting-down session task

Using await asyncio.shield(session_task) here means a cancelled Client.close() no longer cancels the background session task; _disconnect(force=True) can then exit with stop_event already set while session_task is still running. In that window, _connect() treats the task as active (session_task is not None and not done) and returns without starting a fresh session, so a concurrent reconnect can attach to a session that is in teardown and soon resets self._session_state.session to None, causing follow-up client calls to fail. This regression is specific to forced-close cancellation races introduced by shielding.

Useful? React with 👍 / 👎.

jlowin

Thanks @vonbai — the diagnosis here is right. The unshielded await self._session_state.session_task does propagate caller-cancellation into the session task itself, cutting short _session_runner's ability to process stop_event, unwind its AsyncExitStack, and close the transport cleanly. That's worth fixing.

On the asyncio-vs-anyio question: session_task being an asyncio.Task is intentional — there's an existing comment at client.py:561-563 explaining that it has to outlive individual context manager scopes, which anyio's structured concurrency won't allow. So asyncio.shield + asyncio.CancelledError on this specific task is the right family of primitives, not an oversight.

The blocker: Codex's P1 race is real

The shield prevents session_task from being cancelled when the caller is cancelled mid-await. Combined with catching CancelledError for the force path, that opens a window where:

Caller is cancelled during await asyncio.shield(session_task)
CancelledError is caught and swallowed (force + task not cancelled)
finally: runs — session_task.done() is False (still unwinding in the background), so the field isn't cleared
_disconnect returns, lock releases
Concurrent _connect() acquires the lock, sees session_task is not None and not session_task.done(), treats the dying session as live, increments nesting_counter, and doesn't start a fresh runner
Original session task finishes, cleanup runs, transport resets
The "reused" session is stale — subsequent calls fail

The pre-fix behavior avoided this because an unshielded cancelled await would cancel session_task too, making .done() True before the lock released. The shield breaks that invariant.

The cleaner fix: match the existing pattern in this same function

There's already a working shielded-wait-with-timeout pattern a few lines up at client.py:570-579 for the connection-startup cancellation case:

with anyio.CancelScope(shield=True):
    with anyio.move_on_after(3):
        try:
            await session_task
        except asyncio.CancelledError:
            pass

Unlike asyncio.shield, anyio.CancelScope(shield=True) actually waits for the inner await to complete rather than re-raising cancellation immediately. By the time you release the lock, session_task.done() is True and the concurrent-reconnect race is closed. It also keeps this function to a single cancellation pattern instead of two.

Applied to the disconnect path:

session_task = self._session_state.session_task
self._session_state.stop_event.set()
try:
    if force:
        with anyio.CancelScope(shield=True):
            with anyio.move_on_after(self._disconnect_timeout):
                try:
                    await session_task
                except asyncio.CancelledError:
                    pass
    else:
        await session_task
finally:
    self._session_state.session_task = None

Please also add a regression test covering the race directly: spawn a force-close, cancel it mid-await, then immediately attempt a reconnect, and assert the new session actually works (i.e., _connect started a fresh runner rather than reusing the dying one). The existing test_multi_client_force_close doesn't exercise the cancellation path.

Other notes

Your Validation section mentions the focused test "still emitted the existing subprocess/pytest-timeout warning shape as a warning, not a failure." Could you characterize that warning? If it's pre-existing noise unrelated to this change, say so explicitly; if it's new or related, the fix should address it.
CI isn't running yet (only label triage is green). I'll approve workflows once the above is addressed.

Generated with Codex.

vonbai · 2026-04-13T14:05:04Z

Thanks, this was exactly the missing edge case.

I pushed 48fc3b56, which changes the forced disconnect path to use the existing anyio.CancelScope(shield=True) + move_on_after() pattern and clears session_task before releasing the session lock. I also added a direct regression test that pauses session teardown, cancels client.close() mid-wait, immediately reconnects, and asserts the reconnect gets a fresh session task and can call list_tools().

On the warning: I rechecked it in a separate upstream/main worktree. The focused force-close test passes without the warning on the updated branch, while the full tests/test_mcp_config.py xdist run still has a BaseSubprocessTransport.__del__ / event-loop-closed warning in test_multi_client_lifespan. The same warning class is present on upstream main, where test_multi_client_force_close still fails with the original cancellation bug, so I’ve left that subprocess cleanup warning out of this PR’s scope.

Validation run on the updated branch:

uv run pytest tests/client/client/test_client.py::test_force_close_cancelled_wait_starts_fresh_session -q
neighboring client cancellation tests
uv run pytest tests/client/client/test_client.py -q
uv run pytest tests/test_mcp_config.py::test_multi_client_force_close -q
uv run pytest -n auto tests/test_mcp_config.py
uv run pytest -n auto (5400 passed, 3 skipped, 14 xfailed, 1 pre-existing warning)
uv run prek run --all-files

jlowin

Thanks!

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8fcc26eb4b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-13T16:18:06Z

+                else:
+                    await session_task
+            finally:
+                self._session_state.session_task = None


Keep session task tracked until shutdown actually finishes

This finally clears self._session_state.session_task even when the force-disconnect wait exits via move_on_after(...) timeout, so _connect() can start a new runner before the previous runner has fully unwound. If the old runner continues cleanup (for example, a transport that shields or retries cancellation during connect_session teardown), it can still execute _context_manager's finally and call _reset_session_state(), which can null out the newly established session. That creates an intermittent reconnect failure where later client calls see Client is not connected despite a successful reconnect.

Useful? React with 👍 / 👎.

* Harden forced client disconnect cleanup * Handle cancelled force-close waits Generated with Codex. --------- Co-authored-by: Jeremiah Lowin <[email protected]>

Harden forced client disconnect cleanup

2a105b8

marvin-context-protocol Bot added bug Something isn't working. Reports of errors, unexpected behavior, or broken functionality. client Related to the FastMCP client SDK or client-side functionality. labels Apr 13, 2026

chatgpt-codex-connector Bot reviewed Apr 13, 2026

View reviewed changes

jlowin requested changes Apr 13, 2026

View reviewed changes

Handle cancelled force-close waits

48fc3b5

Generated with Codex.

jlowin approved these changes Apr 13, 2026

View reviewed changes

Merge branch 'main' into goalx/forced-close-cleanup

8fcc26e

jlowin merged commit 279b601 into PrefectHQ:main Apr 13, 2026

chatgpt-codex-connector Bot reviewed Apr 13, 2026

View reviewed changes

strawgate pushed a commit that referenced this pull request Apr 14, 2026

Harden forced client disconnect cleanup (#3885)

955c996

* Harden forced client disconnect cleanup * Handle cancelled force-close waits Generated with Codex. --------- Co-authored-by: Jeremiah Lowin <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harden forced client disconnect cleanup#3885

Harden forced client disconnect cleanup#3885
jlowin merged 3 commits intoPrefectHQ:mainfrom
vonbai:goalx/forced-close-cleanup

vonbai commented Apr 13, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 13, 2026

Uh oh!

jlowin left a comment

Uh oh!

vonbai commented Apr 13, 2026

Uh oh!

jlowin left a comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vonbai commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

jlowin left a comment

Choose a reason for hiding this comment

The blocker: Codex's P1 race is real

The cleaner fix: match the existing pattern in this same function

Other notes

Uh oh!

vonbai commented Apr 13, 2026

Uh oh!

jlowin left a comment

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vonbai commented Apr 13, 2026 •

edited

Loading