fix(server): tear down startup resources on lifespan failure (closes #404) by memtomem · Pull Request #406 · memtomem/memtomem

memtomem · 2026-04-23T03:34:37Z

Closes #404 (follow-up to #400 / phase 2 blocker for #399).

Problem

app_lifespan allocates six resources before yield:

webhook_mgr (if enabled)
ctx (AppContext)
comp (via ctx.ensure_initialized() — opens DB/embedder)
watcher (FileWatcher.start())
scheduler + policy_scheduler (if enabled)
watchdog (if enabled)

The old shape only guarded the body behind yield with try/finally. If any step between the first allocation and yield raised, the finally block never ran and everything already allocated leaked.

ensure_initialized() has its own internal try/except close_components (PR #400), so the DB handles it opened are covered. But the rest — webhook_mgr, watcher, scheduler, policy_scheduler, and the just-constructed watchdog — all leaked if startup failed after them.

This becomes a Phase 2/3 blocker for #399's lazy-init refactor: once the DB open moves into request handlers, a late allocation failure (e.g. scheduler.start() racing against network state) becomes more likely than it is in eager-init, and a stale flock on ~/.memtomem/memtomem.db becomes user-visible.

Fix

Factor the teardown sequence into _teardown_startup_resources and call it from both paths:

Normal shutdown — try/finally after yield, same as before, now routed through the helper.
Startup failure — new try/except BaseException around the entire allocation block that runs teardown before re-raising.

Teardown order matches the inverse of allocation. Each step swallows its own exception so a late failure doesn't skip earlier steps — shutdown must always complete whatever it can.

Note on ctx.close() in the teardown path: AppContext.close() is idempotent (_components is None → no-op) and ensure_initialized()'s internal rollback leaves _components unset on failure, so calling ctx.close() from _teardown_startup_resources after an ensure_initialized failure is a safe no-op. No double-close.

Tests

Unit (_teardown_startup_resources):
- All-None input → no-op, no error.
- All present → reverse-allocation order (watchdog → … → ctx).
- Intermediate failure → later steps still run.
- ctx.close() is always the final step.
Integration (test_startup_failure_tears_down_prior_resources):
- Inject a failing HealthWatchdog.start().
- Verify ensure_initialized, watcher, scheduler, policy_scheduler, webhook_mgr, and watchdog.stop() all run in reverse order.
- Verify the original RuntimeError("watchdog boom") re-raises.

Test plan

uv run pytest packages/memtomem/tests/test_server_lifespan.py: 5 passed.
uv run pytest -m "not ollama": 2168 passed (no regressions).
uv run ruff check + ruff format --check: clean.
uv run mypy packages/memtomem/src/memtomem/server/lifespan.py: clean.

Not in scope

Actually moving to lazy init (Phase 2/3 of Lazy DB init: defer ~/.memtomem/memtomem.db creation until first tool call #399) — this PR only closes the teardown gap so the subsequent phases can land safely.
webhook_mgr construction is storage-free and currently cannot raise in practice; we still route it through the same teardown seam for consistency.

…404) Problem (follow-up to #400): ``app_lifespan`` allocates ``webhook_mgr``, ``ctx/components``, ``watcher``, ``scheduler``, ``policy_scheduler``, and ``watchdog`` before ``yield``. The ``finally`` block behind ``yield`` only runs when ``yield`` was reached — so if any step between the first alloc and ``yield`` raised, everything already allocated leaked. ``ensure_initialized`` has its own internal cleanup (PR #400), which covers the DB handles it opened, but the leak still hits for ``webhook_mgr``, ``watcher``, ``scheduler``, ``policy_scheduler``, and ``watchdog`` (the last one just-constructed before ``start()`` raises). Fix: Factor the teardown sequence into ``_teardown_startup_resources`` and call it from both paths: the normal shutdown ``finally`` after ``yield``, and a new startup-level ``try/except BaseException`` that runs teardown before re-raising. Teardown order matches the inverse of allocation. Each step swallows its own exception so a late failure doesn't skip earlier steps — shutdown must always complete whatever it can. Resolves the Phase 2/3 blocker called out in #404 for #399's lazy-init refactor: once the DB open moves into request handlers, a late allocation failure (e.g. scheduler.start) is more likely than in eager-init, and this leak becomes user-visible (stale flock on ``~/.memtomem/memtomem.db``). Tests: - 4 unit tests for ``_teardown_startup_resources`` (all-None noop, reverse-order, error tolerance, ctx-last invariant). - 1 integration test that injects a failing watchdog ``start()`` and asserts all prior allocations get torn down in reverse order and the exception re-raises. Co-Authored-By: Claude <[email protected]>

Review found two issues with the #404 teardown helper: 1. Docstring said "reverse-allocation order" but the actual sequence stops ``webhook_mgr`` at position 4, before ``watcher`` and ``ctx``. This was inherited from the pre-#404 shutdown block and is correct on its merits (webhook has no storage dependency, closing it early drops retries during the slower component teardown) — but the docstring obscured the reasoning. Now documented explicitly. 2. ``except Exception`` doesn't catch ``asyncio.CancelledError`` on Python 3.8+ (it's a ``BaseException`` subclass now). Under task cancellation, teardown would unwind without a log entry — and worse, a cancellation mid-teardown inside an ``except BaseException`` caller could shadow the original startup exception. Changed to ``except BaseException`` with a selective re-raise for ``CancelledError`` that first logs the stage. Also: - Log messages unified on "Shutdown step '<stage>' failed" / "Shutdown step '<stage>' cancelled" so ops greps return the whole teardown surface instead of mixing "Failed to stop X" with "Shutdown step 'Y' failed". - New test ``test_teardown_reraises_cancelled_error`` pins the cancel re-raise + skip-later-steps behavior. - Existing ordering test renamed to ``_matches_documented_sequence`` so the name doesn't claim "reverse allocation" the docstring just clarified isn't the case. Co-Authored-By: Claude <[email protected]>

pandas-studio and others added 2 commits April 23, 2026 12:34

memtomem merged commit 9d03f53 into main Apr 23, 2026
7 checks passed

github-actions Bot locked and limited conversation to collaborators Apr 23, 2026

memtomem deleted the fix/lifespan-startup-teardown branch April 27, 2026 14:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(server): tear down startup resources on lifespan failure (closes #404)#406

fix(server): tear down startup resources on lifespan failure (closes #404)#406
memtomem merged 2 commits intomainfrom
fix/lifespan-startup-teardown

memtomem commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

memtomem commented Apr 23, 2026

Problem

Fix

Tests

Test plan

Not in scope

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants