fix(server): tear down startup resources on lifespan failure (closes #404)#406
Merged
fix(server): tear down startup resources on lifespan failure (closes #404)#406
Conversation
…404) Problem (follow-up to #400): ``app_lifespan`` allocates ``webhook_mgr``, ``ctx/components``, ``watcher``, ``scheduler``, ``policy_scheduler``, and ``watchdog`` before ``yield``. The ``finally`` block behind ``yield`` only runs when ``yield`` was reached — so if any step between the first alloc and ``yield`` raised, everything already allocated leaked. ``ensure_initialized`` has its own internal cleanup (PR #400), which covers the DB handles it opened, but the leak still hits for ``webhook_mgr``, ``watcher``, ``scheduler``, ``policy_scheduler``, and ``watchdog`` (the last one just-constructed before ``start()`` raises). Fix: Factor the teardown sequence into ``_teardown_startup_resources`` and call it from both paths: the normal shutdown ``finally`` after ``yield``, and a new startup-level ``try/except BaseException`` that runs teardown before re-raising. Teardown order matches the inverse of allocation. Each step swallows its own exception so a late failure doesn't skip earlier steps — shutdown must always complete whatever it can. Resolves the Phase 2/3 blocker called out in #404 for #399's lazy-init refactor: once the DB open moves into request handlers, a late allocation failure (e.g. scheduler.start) is more likely than in eager-init, and this leak becomes user-visible (stale flock on ``~/.memtomem/memtomem.db``). Tests: - 4 unit tests for ``_teardown_startup_resources`` (all-None noop, reverse-order, error tolerance, ctx-last invariant). - 1 integration test that injects a failing watchdog ``start()`` and asserts all prior allocations get torn down in reverse order and the exception re-raises. Co-Authored-By: Claude <[email protected]>
Review found two issues with the #404 teardown helper: 1. Docstring said "reverse-allocation order" but the actual sequence stops ``webhook_mgr`` at position 4, before ``watcher`` and ``ctx``. This was inherited from the pre-#404 shutdown block and is correct on its merits (webhook has no storage dependency, closing it early drops retries during the slower component teardown) — but the docstring obscured the reasoning. Now documented explicitly. 2. ``except Exception`` doesn't catch ``asyncio.CancelledError`` on Python 3.8+ (it's a ``BaseException`` subclass now). Under task cancellation, teardown would unwind without a log entry — and worse, a cancellation mid-teardown inside an ``except BaseException`` caller could shadow the original startup exception. Changed to ``except BaseException`` with a selective re-raise for ``CancelledError`` that first logs the stage. Also: - Log messages unified on "Shutdown step '<stage>' failed" / "Shutdown step '<stage>' cancelled" so ops greps return the whole teardown surface instead of mixing "Failed to stop X" with "Shutdown step 'Y' failed". - New test ``test_teardown_reraises_cancelled_error`` pins the cancel re-raise + skip-later-steps behavior. - Existing ordering test renamed to ``_matches_documented_sequence`` so the name doesn't claim "reverse allocation" the docstring just clarified isn't the case. Co-Authored-By: Claude <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #404 (follow-up to #400 / phase 2 blocker for #399).
Problem
app_lifespanallocates six resources beforeyield:webhook_mgr(if enabled)ctx(AppContext)comp(viactx.ensure_initialized()— opens DB/embedder)watcher(FileWatcher.start())scheduler+policy_scheduler(if enabled)watchdog(if enabled)The old shape only guarded the body behind
yieldwithtry/finally. If any step between the first allocation andyieldraised, thefinallyblock never ran and everything already allocated leaked.ensure_initialized()has its own internaltry/except close_components(PR #400), so the DB handles it opened are covered. But the rest —webhook_mgr,watcher,scheduler,policy_scheduler, and the just-constructedwatchdog— all leaked if startup failed after them.This becomes a Phase 2/3 blocker for #399's lazy-init refactor: once the DB open moves into request handlers, a late allocation failure (e.g.
scheduler.start()racing against network state) becomes more likely than it is in eager-init, and a staleflockon~/.memtomem/memtomem.dbbecomes user-visible.Fix
Factor the teardown sequence into
_teardown_startup_resourcesand call it from both paths:try/finallyafteryield, same as before, now routed through the helper.try/except BaseExceptionaround the entire allocation block that runs teardown before re-raising.Teardown order matches the inverse of allocation. Each step swallows its own exception so a late failure doesn't skip earlier steps — shutdown must always complete whatever it can.
Note on
ctx.close()in the teardown path:AppContext.close()is idempotent (_components is None→ no-op) andensure_initialized()'s internal rollback leaves_componentsunset on failure, so callingctx.close()from_teardown_startup_resourcesafter anensure_initializedfailure is a safe no-op. No double-close.Tests
_teardown_startup_resources):watchdog→ … →ctx).ctx.close()is always the final step.test_startup_failure_tears_down_prior_resources):HealthWatchdog.start().ensure_initialized,watcher,scheduler,policy_scheduler,webhook_mgr, andwatchdog.stop()all run in reverse order.RuntimeError("watchdog boom")re-raises.Test plan
uv run pytest packages/memtomem/tests/test_server_lifespan.py: 5 passed.uv run pytest -m "not ollama": 2168 passed (no regressions).uv run ruff check+ruff format --check: clean.uv run mypy packages/memtomem/src/memtomem/server/lifespan.py: clean.Not in scope
webhook_mgrconstruction is storage-free and currently cannot raise in practice; we still route it through the same teardown seam for consistency.