Skip to content

fix(entity_registry): atomic write to prevent partial corruption on crash#1215

Merged
igorls merged 2 commits intoMemPalace:developfrom
arnoldwender:fix/entity-registry-atomic-write
May 6, 2026
Merged

fix(entity_registry): atomic write to prevent partial corruption on crash#1215
igorls merged 2 commits intoMemPalace:developfrom
arnoldwender:fix/entity-registry-atomic-write

Conversation

@arnoldwender
Copy link
Copy Markdown
Contributor

What and Why

EntityRegistry.save() calls Path.write_text() directly, which truncates the target file and then writes. A crash between truncate and full-flush (power loss, OOM, filesystem full, kill -9) leaves an empty or half-written entity_registry.json. The whole people/projects map is lost; the system silently falls back to an empty registry on next load() (json.JSONDecodeError is swallowed in load()).

For a tool whose value is built up from months of mining transcripts and disambiguating entities, this is a real data-loss risk on any crash.

Root Cause

mempalace/entity_registry.py:323self._path.write_text(...) is not atomic. The kernel sees: open(..., 'w') (truncates), then write, then close. Any failure between truncate and successful close corrupts the file.

Fix

Standard atomic-write pattern:

  1. Serialize to a sibling .tmp file in the same directory (so os.replace stays on one filesystem)
  2. fsync the temp file
  3. chmod 0o600 the temp file before rename
  4. os.replace(tmp, target) — atomic on POSIX and Windows

Any crash before os.replace returns leaves the previous registry intact. The new registry only becomes visible to readers when the rename has fully landed on disk.

Test plan

  • test_save_is_atomic_does_not_leave_tmp.tmp sidecar removed on successful save
  • test_save_preserves_previous_on_serialization_failure — previous content unchanged when os.replace raises mid-save (simulates filesystem-full or permission flip after temp write)
  • All 31 test_entity_registry.py tests pass
  • Full suite: 1318 passed locally (2 pre-existing test_backends.py::test_pin_hnsw_threads* failures unrelated — chromadb configuration_json["hnsw"] schema drift, also fails on clean develop)

@jphein
Copy link
Copy Markdown
Collaborator

jphein commented Apr 27, 2026

+1 to the atomic-write pattern; the truncate-then-write window in Path.write_text is the kind of thing that bites on power loss exactly when you can least afford to lose the file.

One small hardening that's worth considering for the rename-durability claim: os.replace is atomic with respect to readers seeing either the old or the new file, but on some Linux filesystems (notably ext4 with default mount options) the rename's durability across power loss requires an additional fsync on the parent directory. Without it, the kernel can ack the rename, then a crash reverts to a state where the temp file is present and the target is at the old version.

os.replace(tmp_path, self._path)
dir_fd = os.open(str(self._path.parent), os.O_RDONLY)
try:
    os.fsync(dir_fd)
finally:
    os.close(dir_fd)

For entity_registry.json specifically this is academic — the file is regenerated by mempalace init on demand — but the PR motivation frames the registry as months-of-accumulated-state, and at that bar the directory fsync is the standard finishing move (SQLite's atomic-commit doc spells it out). Cheap belt-and-suspenders if you want to take it; happy to land as-is otherwise since the core truncate-window fix is the load-bearing part.

arnoldwender added a commit to arnoldwender/mempalace that referenced this pull request Apr 28, 2026
Without this, on ext4 (and similar) filesystems the rename ack does not
guarantee durability across power loss — a crash can revert to a state
where the temp file is present and the target is at the old version.

Suggested by @jphein on MemPalace#1215.
@arnoldwender
Copy link
Copy Markdown
Contributor Author

@jphein Good catch — the parent-dir fsync is the natural completion of "atomic write means durable write." Even though entity_registry.json is regenerable, leaving the gap would be inconsistent with the PR framing.

Folded it in as 490f585. OSError is swallowed for Windows and special filesystems that reject directory fds (they have different durability semantics on rename anyway).

Test suite: 1317 passed. (Two unrelated pre-existing failures in test_backends.py HNSW retrofit reproduce on develop too — not from this change.)

Ready for re-review.

@jphein
Copy link
Copy Markdown
Collaborator

jphein commented Apr 30, 2026

Thanks @arnoldwender490f585 looks right. Swallowing OSError for Windows and special filesystems is the right call (os.fsync(fd) on a directory has no portable Windows equivalent, and the rename durability semantics differ enough there that the per-platform skip is the cleanest path). The PR is now strict-stronger than my original review asked for — +1 to merge from my side once CI is green.

@arnoldwender
Copy link
Copy Markdown
Contributor Author

Friendly ping — CI is green across all platforms and @jphein gave +1 on 2026-04-30. Ready to merge whenever a maintainer has a moment.

…rash

EntityRegistry.save() called Path.write_text() directly, which truncates
the target file and then writes — so a crash mid-write (power loss, OOM,
filesystem-full mid-flush) leaves an empty or half-written
entity_registry.json. The whole people/projects map is lost; the system
falls back to an empty registry on next load.

Switch to the standard atomic-write pattern: serialize to a sibling
.tmp file in the same directory (so os.replace stays on one filesystem),
fsync, chmod 0o600, then os.replace over the target. The replace is
atomic on POSIX and Windows, so any crash leaves the previous registry
intact instead of a truncated file.

Tests cover: no leftover .tmp on success, and previous content preserved
when os.replace itself raises mid-save.
Without this, on ext4 (and similar) filesystems the rename ack does not
guarantee durability across power loss — a crash can revert to a state
where the temp file is present and the target is at the old version.

Suggested by @jphein on MemPalace#1215.
@arnoldwender arnoldwender force-pushed the fix/entity-registry-atomic-write branch from 490f585 to 2e441d1 Compare May 4, 2026 09:08
@igorls igorls merged commit e18981a into MemPalace:develop May 6, 2026
6 checks passed
igorls added a commit that referenced this pull request May 6, 2026
…1282 #1167 #1160

Bundled CHANGELOG entries for the seven Tier-1 PRs merged today, including
the behavior-change call-out for #1167 (KG date validators now reject
non-ISO inputs that previously produced silent empty results).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants