Skip to content

fix(server): truncate pid file only after acquiring flock#476

Merged
memtomem merged 2 commits intomainfrom
fix/pid-file-truncate-race
Apr 25, 2026
Merged

fix(server): truncate pid file only after acquiring flock#476
memtomem merged 2 commits intomainfrom
fix/pid-file-truncate-race

Conversation

@tsdata
Copy link
Copy Markdown
Collaborator

@tsdata tsdata commented Apr 25, 2026

Summary

server/__init__.py:main opened the runtime pid file with open(pid_file, "w"), which truncates at open time — before the fcntl.flock(LOCK_EX | LOCK_NB) probe decides whether this process is the lock holder. When a second memtomem-server starts while the first is still running (multiple Claude Code / Codex / Gemini MCP clients spawning in parallel, or any restart race), the second process zeros out the live server's pid file, then bails on the flock probe and falls through. The first server keeps running, but the recorded pid is gone from disk.

The user-visible symptom is mm uninstall reporting:

Server still running (pid None). Refusing to delete state — an active server holds the SQLite WAL and deleting it risks corruption.

The refusal is correct (flock IS held, removing the WAL would risk corruption) — but pid None loses the diagnostic value of the recorded process identity. lsof <pidfile> and ps -p <pid> both stop working as a way to find the holder.

The legacy compatibility helper _try_hold_legacy_flock already uses the safe "a+b" pattern, so this is just the primary lock acquisition that never got the same treatment. The open-mode race has been latent since the pid lock was introduced (0597de4, predates the #412 path relocation).

Fix

  • server/__init__.py: open with "a+" (no truncate at open), acquire the flock, then seek(0) + truncate() + write(pid). File content is only mutated after the lock decision.
  • cli/uninstall_cmd.py: when the recorded pid is None but the flock IS held (truncate-race fingerprint, plus partial-write crashes during startup), surface a dedicated "pid unknown" branch that points the user at lsof <pidfile> instead of the misleading pid None.

Test plan

  • uv run pytest packages/memtomem/tests/ -m "not ollama" — 2443 passed
  • uv run ruff check packages/memtomem/src packages/memtomem/tests — clean
  • uv run ruff format --check ... — clean
  • uv run mypy packages/memtomem/src/memtomem/server/__init__.py packages/memtomem/src/memtomem/cli/uninstall_cmd.py — no issues
  • Regression-test parity: stashed the fix and re-ran the two new tests against pre-fix source. Both fail with the exact symptoms (Got: '' for the truncated content; Server still running (pid None) literal for the user-reported message). Tests pass after restoring the fix.

Compatibility audit

Verified the fix doesn't conflict with the rationale of any prior PR touching this code path:

PR Intent Conflict with this fix
#412 (path relocation to $XDG_RUNTIME_DIR) Keep ~/.memtomem/ out of the MCP handshake None — same target path, no new opens
#413 (runtime dir security gates: lstat + owner + 0o700) Reject hostile pre-existing dirs None — gates run before open
#437 / #439 (atexit unlink ordering on legacy pid) Stale-file cleanup None — teardown unchanged
#444 / #445 (legacy LOCK_SH for 0.1.26+ coexistence) Multiple post-#412 servers can run together None — different file

🤖 Generated with Claude Code

@tsdata
Copy link
Copy Markdown
Collaborator Author

tsdata commented Apr 25, 2026

Review

server/__init__.py:main 의 pid file open 모드를 "w""a+" 로 변경하고 flock 획득 후에만 seek(0) + truncate() + write(pid) 시퀀스를 실행. 두 번째 서버가 시작 중 첫 서버의 live pid file을 zero-out 하던 race를 닫음. mm uninstallServer still running (pid None) 메시지를 "pid unknown + lsof 안내" 분기로 분리.

Correctness — 견고함

Code quality / style

  • seek(0) + truncate() 대신 _lock_fp.truncate(0) 쪽이 더 명시적이지만, 현재 코드도 정확. 굳이 바꿀 가치는 적음.
  • 12줄짜리 인라인 코멘트 (server/__init__.py:296-308) 는 다소 길지만, 규제 사유를 미래 reader에게 전달해야 하는 invariant라 정당화 가능. uninstall_cmd.py 의 12줄 코멘트도 같은 사유.
  • # noqa: SIM115 보존됨. atexit.register LIFO 주석도 보존.
  • f"...lsof {server.pid_file}" — pid_file 경로에 공백이 들어갈 가능성은 $XDG_RUNTIME_DIR/memtomem/server.pid 이라 거의 0. 별도 quoting 불필요.

Tests

  • test_contended_server_start_preserves_pid_file_content: 정확히 regression의 fingerprint 재현. PR body의 "stashed the fix and re-ran ... fail with Got: ''" 가 fail-if-false 원칙 충족.
  • test_refuses_with_unknown_pid_branch_when_pid_file_empty: 'pid unknown' literal + 'lsof' 키워드 두 가지 모두 assert. 메시지 drift 방지에 충분.
  • Sleep tax: time.sleep(1.5) 1회. cold-start python -m memtomem.server 변동성 고려하면 합리적. 상한 wait + early exit poll 패턴 (_wait_for_pid_file 처럼) 으로 더 줄일 여지는 있지만 racy regression이 ms-scale 이라 큰 차이 없음. 그대로 두어도 무방.

Performance / Security

  • 추가 syscall 1개 (truncate) — flock 보유 중 수행되므로 contention 없음. 무시해도 되는 수준.
  • 보안: file content는 own pid (정보 누설 X). race의 결과가 zero-out 였기에 오히려 fix는 정보 보존 쪽.

작은 개선 제안 (non-blocking)

  1. seek(0); truncate()truncate(0) 한 줄. 실제 동작 동일.
  2. uninstall_cmd.py:692 의 메시지에 lsof 가 없는 환경 (예: 일부 컨테이너) 안내로 ps aux | grep memtomem 도 함께 노출. 같은 파일의 db_lock 분기 (line 704) 는 둘 다 보여줌 — symmetry 차원에서.

Verdict

flock-open-mode 메모리 규칙의 정확한 적용이고, regression 증명 + compatibility audit 모두 갖춤. 머지 적합. 위 두 nit은 follow-up commit 으로 묶거나 그대로 두어도 됨.

Copy link
Copy Markdown
Owner

@memtomem memtomem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed inline; LGTM per the analysis comment above.

pandas-studio and others added 2 commits April 26, 2026 07:40
`server/__init__.py:main` opened the runtime pid file with
`open(pid_file, "w")`, which truncates the file at *open* time —
before the `fcntl.flock(LOCK_EX | LOCK_NB)` probe decides whether
this process is the lock holder. When a second `memtomem-server`
starts while the first is still running (multiple Claude Code /
Codex / Gemini MCP clients spawning the binary in parallel, or any
restart race), the second process truncates the live server's
pid file to zero bytes, then bails on the flock probe and falls
through. The first server keeps running, but the recorded pid is
gone from disk.

The user-visible symptom: `mm uninstall` reports

    Server still running (pid None). Refusing to delete state ...

The refusal itself is correct — flock IS held by the live writer
and removing the WAL would risk corruption — but `pid None` loses
the diagnostic value of the recorded process identity. `lsof
<pidfile>` and `ps -p <pid>` both stop working as a way to find
the holder. The legacy compatibility helper
`_try_hold_legacy_flock` already uses the safe `"a+b"` pattern, so
this is just the primary lock acquisition that never got the same
treatment; the open-mode race has been latent since the pid lock
was introduced (`0597de4`, predates the #412 path relocation).

Fix: open with `"a+"` (append, no truncate at open), acquire the
flock, then `seek(0) + truncate() + write(pid)`. The file content
is only mutated after the lock decision, so a contended start
leaves the live holder's pid intact.

Also tightens the `mm uninstall` refusal message: when the pid
field is None but the flock is held (the truncate-race fingerprint,
plus partial-write crashes during startup), surface a dedicated
"pid unknown" branch that points the user at `lsof <pidfile>` so
they can identify the writer without another diagnostic round-trip.

Regression tests pin both behaviors:

- `test_contended_server_start_preserves_pid_file_content` spawns
  the server with an outside `LOCK_EX` already held on the pid
  file and asserts the pre-seeded content survives. Verified to
  fail on the pre-fix code with `Got: ''`.
- `test_refuses_with_unknown_pid_branch_when_pid_file_empty`
  pins the new uninstall message branch via empty content + held
  flock. Verified to fail on the pre-fix code with the exact
  `Server still running (pid None)` string the user reported.

Co-authored-by: pandas-studio <[email protected]>
Co-authored-by: Claude <[email protected]>
`_probe_pid_file` only returns ``alive=True`` with the ``pid_file``
field populated; the ``pid_file=None`` return path is paired with
``alive=False``. So inside ``if server.alive: if server.pid is None:``
the ``server.pid_file is not None`` guard always passes and the
``ps aux | grep memtomem`` arm is dead code.

Per CLAUDE.md ("don't add fallbacks for scenarios that can't happen,
trust internal code guarantees"), drop the ternary and reference
``server.pid_file`` directly. Note the invariant in a comment so a
future ``_probe_pid_file`` change that relaxes it surfaces the
contract violation here.

Tests unchanged — the existing ``test_refuses_with_unknown_pid_branch_
when_pid_file_empty`` already pins the ``lsof`` substring, which is
exactly what the simplified branch emits.

Co-authored-by: pandas-studio <[email protected]>
Co-authored-by: Claude <[email protected]>
@memtomem memtomem force-pushed the fix/pid-file-truncate-race branch from 7a7de01 to 6ac3360 Compare April 25, 2026 22:40
@memtomem memtomem merged commit 26f31db into main Apr 25, 2026
7 checks passed
@memtomem memtomem deleted the fix/pid-file-truncate-race branch April 25, 2026 22:42
@github-actions github-actions Bot locked and limited conversation to collaborators Apr 25, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants