Skip to content

fix(server): tear down startup resources on lifespan failure (closes #404)#406

Merged
memtomem merged 2 commits intomainfrom
fix/lifespan-startup-teardown
Apr 23, 2026
Merged

fix(server): tear down startup resources on lifespan failure (closes #404)#406
memtomem merged 2 commits intomainfrom
fix/lifespan-startup-teardown

Conversation

@memtomem
Copy link
Copy Markdown
Owner

Closes #404 (follow-up to #400 / phase 2 blocker for #399).

Problem

app_lifespan allocates six resources before yield:

  1. webhook_mgr (if enabled)
  2. ctx (AppContext)
  3. comp (via ctx.ensure_initialized() — opens DB/embedder)
  4. watcher (FileWatcher.start())
  5. scheduler + policy_scheduler (if enabled)
  6. watchdog (if enabled)

The old shape only guarded the body behind yield with try/finally. If any step between the first allocation and yield raised, the finally block never ran and everything already allocated leaked.

ensure_initialized() has its own internal try/except close_components (PR #400), so the DB handles it opened are covered. But the rest — webhook_mgr, watcher, scheduler, policy_scheduler, and the just-constructed watchdog — all leaked if startup failed after them.

This becomes a Phase 2/3 blocker for #399's lazy-init refactor: once the DB open moves into request handlers, a late allocation failure (e.g. scheduler.start() racing against network state) becomes more likely than it is in eager-init, and a stale flock on ~/.memtomem/memtomem.db becomes user-visible.

Fix

Factor the teardown sequence into _teardown_startup_resources and call it from both paths:

  • Normal shutdowntry/finally after yield, same as before, now routed through the helper.
  • Startup failure — new try/except BaseException around the entire allocation block that runs teardown before re-raising.

Teardown order matches the inverse of allocation. Each step swallows its own exception so a late failure doesn't skip earlier steps — shutdown must always complete whatever it can.

Note on ctx.close() in the teardown path: AppContext.close() is idempotent (_components is None → no-op) and ensure_initialized()'s internal rollback leaves _components unset on failure, so calling ctx.close() from _teardown_startup_resources after an ensure_initialized failure is a safe no-op. No double-close.

Tests

  • Unit (_teardown_startup_resources):
    • All-None input → no-op, no error.
    • All present → reverse-allocation order (watchdog → … → ctx).
    • Intermediate failure → later steps still run.
    • ctx.close() is always the final step.
  • Integration (test_startup_failure_tears_down_prior_resources):
    • Inject a failing HealthWatchdog.start().
    • Verify ensure_initialized, watcher, scheduler, policy_scheduler, webhook_mgr, and watchdog.stop() all run in reverse order.
    • Verify the original RuntimeError("watchdog boom") re-raises.

Test plan

  • uv run pytest packages/memtomem/tests/test_server_lifespan.py: 5 passed.
  • uv run pytest -m "not ollama": 2168 passed (no regressions).
  • uv run ruff check + ruff format --check: clean.
  • uv run mypy packages/memtomem/src/memtomem/server/lifespan.py: clean.

Not in scope

pandas-studio and others added 2 commits April 23, 2026 12:34
…404)

Problem (follow-up to #400):

``app_lifespan`` allocates ``webhook_mgr``, ``ctx/components``,
``watcher``, ``scheduler``, ``policy_scheduler``, and ``watchdog``
before ``yield``. The ``finally`` block behind ``yield`` only runs
when ``yield`` was reached — so if any step between the first alloc
and ``yield`` raised, everything already allocated leaked.

``ensure_initialized`` has its own internal cleanup (PR #400), which
covers the DB handles it opened, but the leak still hits for
``webhook_mgr``, ``watcher``, ``scheduler``, ``policy_scheduler``, and
``watchdog`` (the last one just-constructed before ``start()`` raises).

Fix:

Factor the teardown sequence into ``_teardown_startup_resources`` and
call it from both paths: the normal shutdown ``finally`` after
``yield``, and a new startup-level ``try/except BaseException`` that
runs teardown before re-raising. Teardown order matches the inverse
of allocation. Each step swallows its own exception so a late failure
doesn't skip earlier steps — shutdown must always complete whatever
it can.

Resolves the Phase 2/3 blocker called out in #404 for #399's lazy-init
refactor: once the DB open moves into request handlers, a late
allocation failure (e.g. scheduler.start) is more likely than in
eager-init, and this leak becomes user-visible (stale flock on
``~/.memtomem/memtomem.db``).

Tests:

- 4 unit tests for ``_teardown_startup_resources`` (all-None noop,
  reverse-order, error tolerance, ctx-last invariant).
- 1 integration test that injects a failing watchdog ``start()`` and
  asserts all prior allocations get torn down in reverse order and
  the exception re-raises.

Co-Authored-By: Claude <[email protected]>
Review found two issues with the #404 teardown helper:

1. Docstring said "reverse-allocation order" but the actual sequence
   stops ``webhook_mgr`` at position 4, before ``watcher`` and ``ctx``.
   This was inherited from the pre-#404 shutdown block and is correct
   on its merits (webhook has no storage dependency, closing it early
   drops retries during the slower component teardown) — but the
   docstring obscured the reasoning. Now documented explicitly.

2. ``except Exception`` doesn't catch ``asyncio.CancelledError`` on
   Python 3.8+ (it's a ``BaseException`` subclass now). Under task
   cancellation, teardown would unwind without a log entry — and worse,
   a cancellation mid-teardown inside an ``except BaseException``
   caller could shadow the original startup exception. Changed to
   ``except BaseException`` with a selective re-raise for
   ``CancelledError`` that first logs the stage.

Also:

- Log messages unified on "Shutdown step '<stage>' failed" /
  "Shutdown step '<stage>' cancelled" so ops greps return the whole
  teardown surface instead of mixing "Failed to stop X" with
  "Shutdown step 'Y' failed".
- New test ``test_teardown_reraises_cancelled_error`` pins the cancel
  re-raise + skip-later-steps behavior.
- Existing ordering test renamed to ``_matches_documented_sequence``
  so the name doesn't claim "reverse allocation" the docstring just
  clarified isn't the case.

Co-Authored-By: Claude <[email protected]>
@memtomem memtomem merged commit 9d03f53 into main Apr 23, 2026
7 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators Apr 23, 2026
@memtomem memtomem deleted the fix/lifespan-startup-teardown branch April 27, 2026 14:56
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

server lifespan: resource leak if watcher/scheduler/policy_scheduler/watchdog startup raises before yield (#400 follow-up)

2 participants