Skip to content

Harden forced client disconnect cleanup#3885

Merged
jlowin merged 3 commits intoPrefectHQ:mainfrom
vonbai:goalx/forced-close-cleanup
Apr 13, 2026
Merged

Harden forced client disconnect cleanup#3885
jlowin merged 3 commits intoPrefectHQ:mainfrom
vonbai:goalx/forced-close-cleanup

Conversation

@vonbai
Copy link
Copy Markdown
Contributor

@vonbai vonbai commented Apr 13, 2026

This fixes forced Client.close() cleanup when caller cancellation arrives while the client is waiting for the background session task to unwind. The forced disconnect path now uses the same anyio.CancelScope(shield=True) + move_on_after() pattern used during connection-startup cancellation, then clears session_task before releasing the session lock so a concurrent reconnect cannot reuse a dying runner.

The regression test pauses session teardown, cancels a force-close mid-wait, immediately attempts a reconnect, and asserts the reconnect gets a fresh session task and can call list_tools().

Validation: uv run pytest tests/client/client/test_client.py::test_force_close_cancelled_wait_starts_fresh_session -q, focused neighboring cancellation tests, uv run pytest tests/client/client/test_client.py -q, uv run pytest tests/test_mcp_config.py::test_multi_client_force_close -q, uv run pytest -n auto tests/test_mcp_config.py, uv run pytest -n auto, and uv run prek run --all-files.

Warning note: the full tests/test_mcp_config.py run under xdist still emits one PytestUnraisableExceptionWarning from BaseSubprocessTransport.__del__ in test_multi_client_lifespan. I checked an upstream/main baseline in a separate worktree: the same subprocess/event-loop-closed warning class is already present there, and the baseline still fails test_multi_client_force_close with the original cancellation bug. The focused force-close test now passes without that warning.

@marvin-context-protocol marvin-context-protocol Bot added bug Something isn't working. Reports of errors, unexpected behavior, or broken functionality. client Related to the FastMCP client SDK or client-side functionality. labels Apr 13, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2a105b8b91

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/fastmcp/client/client.py Outdated
Comment on lines +649 to +650
if force:
await asyncio.shield(session_task)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Prevent reconnect from reusing a shutting-down session task

Using await asyncio.shield(session_task) here means a cancelled Client.close() no longer cancels the background session task; _disconnect(force=True) can then exit with stop_event already set while session_task is still running. In that window, _connect() treats the task as active (session_task is not None and not done) and returns without starting a fresh session, so a concurrent reconnect can attach to a session that is in teardown and soon resets self._session_state.session to None, causing follow-up client calls to fail. This regression is specific to forced-close cancellation races introduced by shielding.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member

@jlowin jlowin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vonbai — the diagnosis here is right. The unshielded await self._session_state.session_task does propagate caller-cancellation into the session task itself, cutting short _session_runner's ability to process stop_event, unwind its AsyncExitStack, and close the transport cleanly. That's worth fixing.

On the asyncio-vs-anyio question: session_task being an asyncio.Task is intentional — there's an existing comment at client.py:561-563 explaining that it has to outlive individual context manager scopes, which anyio's structured concurrency won't allow. So asyncio.shield + asyncio.CancelledError on this specific task is the right family of primitives, not an oversight.

The blocker: Codex's P1 race is real

The shield prevents session_task from being cancelled when the caller is cancelled mid-await. Combined with catching CancelledError for the force path, that opens a window where:

  1. Caller is cancelled during await asyncio.shield(session_task)
  2. CancelledError is caught and swallowed (force + task not cancelled)
  3. finally: runs — session_task.done() is False (still unwinding in the background), so the field isn't cleared
  4. _disconnect returns, lock releases
  5. Concurrent _connect() acquires the lock, sees session_task is not None and not session_task.done(), treats the dying session as live, increments nesting_counter, and doesn't start a fresh runner
  6. Original session task finishes, cleanup runs, transport resets
  7. The "reused" session is stale — subsequent calls fail

The pre-fix behavior avoided this because an unshielded cancelled await would cancel session_task too, making .done() True before the lock released. The shield breaks that invariant.

The cleaner fix: match the existing pattern in this same function

There's already a working shielded-wait-with-timeout pattern a few lines up at client.py:570-579 for the connection-startup cancellation case:

with anyio.CancelScope(shield=True):
    with anyio.move_on_after(3):
        try:
            await session_task
        except asyncio.CancelledError:
            pass

Unlike asyncio.shield, anyio.CancelScope(shield=True) actually waits for the inner await to complete rather than re-raising cancellation immediately. By the time you release the lock, session_task.done() is True and the concurrent-reconnect race is closed. It also keeps this function to a single cancellation pattern instead of two.

Applied to the disconnect path:

session_task = self._session_state.session_task
self._session_state.stop_event.set()
try:
    if force:
        with anyio.CancelScope(shield=True):
            with anyio.move_on_after(self._disconnect_timeout):
                try:
                    await session_task
                except asyncio.CancelledError:
                    pass
    else:
        await session_task
finally:
    self._session_state.session_task = None

Please also add a regression test covering the race directly: spawn a force-close, cancel it mid-await, then immediately attempt a reconnect, and assert the new session actually works (i.e., _connect started a fresh runner rather than reusing the dying one). The existing test_multi_client_force_close doesn't exercise the cancellation path.

Other notes

  • Your Validation section mentions the focused test "still emitted the existing subprocess/pytest-timeout warning shape as a warning, not a failure." Could you characterize that warning? If it's pre-existing noise unrelated to this change, say so explicitly; if it's new or related, the fix should address it.
  • CI isn't running yet (only label triage is green). I'll approve workflows once the above is addressed.

Generated with Codex.
@vonbai
Copy link
Copy Markdown
Contributor Author

vonbai commented Apr 13, 2026

Thanks, this was exactly the missing edge case.

I pushed 48fc3b56, which changes the forced disconnect path to use the existing anyio.CancelScope(shield=True) + move_on_after() pattern and clears session_task before releasing the session lock. I also added a direct regression test that pauses session teardown, cancels client.close() mid-wait, immediately reconnects, and asserts the reconnect gets a fresh session task and can call list_tools().

On the warning: I rechecked it in a separate upstream/main worktree. The focused force-close test passes without the warning on the updated branch, while the full tests/test_mcp_config.py xdist run still has a BaseSubprocessTransport.__del__ / event-loop-closed warning in test_multi_client_lifespan. The same warning class is present on upstream main, where test_multi_client_force_close still fails with the original cancellation bug, so I’ve left that subprocess cleanup warning out of this PR’s scope.

Validation run on the updated branch:

  • uv run pytest tests/client/client/test_client.py::test_force_close_cancelled_wait_starts_fresh_session -q
  • neighboring client cancellation tests
  • uv run pytest tests/client/client/test_client.py -q
  • uv run pytest tests/test_mcp_config.py::test_multi_client_force_close -q
  • uv run pytest -n auto tests/test_mcp_config.py
  • uv run pytest -n auto (5400 passed, 3 skipped, 14 xfailed, 1 pre-existing warning)
  • uv run prek run --all-files

Copy link
Copy Markdown
Member

@jlowin jlowin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@jlowin jlowin merged commit 279b601 into PrefectHQ:main Apr 13, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8fcc26eb4b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

else:
await session_task
finally:
self._session_state.session_task = None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep session task tracked until shutdown actually finishes

This finally clears self._session_state.session_task even when the force-disconnect wait exits via move_on_after(...) timeout, so _connect() can start a new runner before the previous runner has fully unwound. If the old runner continues cleanup (for example, a transport that shields or retries cancellation during connect_session teardown), it can still execute _context_manager's finally and call _reset_session_state(), which can null out the newly established session. That creates an intermittent reconnect failure where later client calls see Client is not connected despite a successful reconnect.

Useful? React with 👍 / 👎.

strawgate pushed a commit that referenced this pull request Apr 14, 2026
* Harden forced client disconnect cleanup

* Handle cancelled force-close waits

Generated with Codex.

---------

Co-authored-by: Jeremiah Lowin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working. Reports of errors, unexpected behavior, or broken functionality. client Related to the FastMCP client SDK or client-side functionality.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants