fix(server): truncate pid file only after acquiring flock#476
Merged
Conversation
Collaborator
Author
Review
Correctness — 견고함
Code quality / style
Tests
Performance / Security
작은 개선 제안 (non-blocking)
Verdictflock-open-mode 메모리 규칙의 정확한 적용이고, regression 증명 + compatibility audit 모두 갖춤. 머지 적합. 위 두 nit은 follow-up commit 으로 묶거나 그대로 두어도 됨. |
memtomem
approved these changes
Apr 25, 2026
Owner
memtomem
left a comment
There was a problem hiding this comment.
Reviewed inline; LGTM per the analysis comment above.
`server/__init__.py:main` opened the runtime pid file with
`open(pid_file, "w")`, which truncates the file at *open* time —
before the `fcntl.flock(LOCK_EX | LOCK_NB)` probe decides whether
this process is the lock holder. When a second `memtomem-server`
starts while the first is still running (multiple Claude Code /
Codex / Gemini MCP clients spawning the binary in parallel, or any
restart race), the second process truncates the live server's
pid file to zero bytes, then bails on the flock probe and falls
through. The first server keeps running, but the recorded pid is
gone from disk.
The user-visible symptom: `mm uninstall` reports
Server still running (pid None). Refusing to delete state ...
The refusal itself is correct — flock IS held by the live writer
and removing the WAL would risk corruption — but `pid None` loses
the diagnostic value of the recorded process identity. `lsof
<pidfile>` and `ps -p <pid>` both stop working as a way to find
the holder. The legacy compatibility helper
`_try_hold_legacy_flock` already uses the safe `"a+b"` pattern, so
this is just the primary lock acquisition that never got the same
treatment; the open-mode race has been latent since the pid lock
was introduced (`0597de4`, predates the #412 path relocation).
Fix: open with `"a+"` (append, no truncate at open), acquire the
flock, then `seek(0) + truncate() + write(pid)`. The file content
is only mutated after the lock decision, so a contended start
leaves the live holder's pid intact.
Also tightens the `mm uninstall` refusal message: when the pid
field is None but the flock is held (the truncate-race fingerprint,
plus partial-write crashes during startup), surface a dedicated
"pid unknown" branch that points the user at `lsof <pidfile>` so
they can identify the writer without another diagnostic round-trip.
Regression tests pin both behaviors:
- `test_contended_server_start_preserves_pid_file_content` spawns
the server with an outside `LOCK_EX` already held on the pid
file and asserts the pre-seeded content survives. Verified to
fail on the pre-fix code with `Got: ''`.
- `test_refuses_with_unknown_pid_branch_when_pid_file_empty`
pins the new uninstall message branch via empty content + held
flock. Verified to fail on the pre-fix code with the exact
`Server still running (pid None)` string the user reported.
Co-authored-by: pandas-studio <[email protected]>
Co-authored-by: Claude <[email protected]>
`_probe_pid_file` only returns ``alive=True`` with the ``pid_file``
field populated; the ``pid_file=None`` return path is paired with
``alive=False``. So inside ``if server.alive: if server.pid is None:``
the ``server.pid_file is not None`` guard always passes and the
``ps aux | grep memtomem`` arm is dead code.
Per CLAUDE.md ("don't add fallbacks for scenarios that can't happen,
trust internal code guarantees"), drop the ternary and reference
``server.pid_file`` directly. Note the invariant in a comment so a
future ``_probe_pid_file`` change that relaxes it surfaces the
contract violation here.
Tests unchanged — the existing ``test_refuses_with_unknown_pid_branch_
when_pid_file_empty`` already pins the ``lsof`` substring, which is
exactly what the simplified branch emits.
Co-authored-by: pandas-studio <[email protected]>
Co-authored-by: Claude <[email protected]>
7a7de01 to
6ac3360
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
server/__init__.py:mainopened the runtime pid file withopen(pid_file, "w"), which truncates at open time — before thefcntl.flock(LOCK_EX | LOCK_NB)probe decides whether this process is the lock holder. When a secondmemtomem-serverstarts while the first is still running (multiple Claude Code / Codex / Gemini MCP clients spawning in parallel, or any restart race), the second process zeros out the live server's pid file, then bails on the flock probe and falls through. The first server keeps running, but the recorded pid is gone from disk.The user-visible symptom is
mm uninstallreporting:The refusal is correct (flock IS held, removing the WAL would risk corruption) — but
pid Noneloses the diagnostic value of the recorded process identity.lsof <pidfile>andps -p <pid>both stop working as a way to find the holder.The legacy compatibility helper
_try_hold_legacy_flockalready uses the safe"a+b"pattern, so this is just the primary lock acquisition that never got the same treatment. The open-mode race has been latent since the pid lock was introduced (0597de4, predates the #412 path relocation).Fix
server/__init__.py: open with"a+"(no truncate at open), acquire the flock, thenseek(0) + truncate() + write(pid). File content is only mutated after the lock decision.cli/uninstall_cmd.py: when the recorded pid isNonebut the flock IS held (truncate-race fingerprint, plus partial-write crashes during startup), surface a dedicated "pid unknown" branch that points the user atlsof <pidfile>instead of the misleadingpid None.Test plan
uv run pytest packages/memtomem/tests/ -m "not ollama"— 2443 passeduv run ruff check packages/memtomem/src packages/memtomem/tests— cleanuv run ruff format --check ...— cleanuv run mypy packages/memtomem/src/memtomem/server/__init__.py packages/memtomem/src/memtomem/cli/uninstall_cmd.py— no issuesGot: ''for the truncated content;Server still running (pid None)literal for the user-reported message). Tests pass after restoring the fix.Compatibility audit
Verified the fix doesn't conflict with the rationale of any prior PR touching this code path:
$XDG_RUNTIME_DIR)~/.memtomem/out of the MCP handshakelstat+ owner +0o700)openLOCK_SHfor 0.1.26+ coexistence)🤖 Generated with Claude Code