fix(entity_registry): atomic write to prevent partial corruption on crash#1215
Conversation
|
+1 to the atomic-write pattern; the truncate-then-write window in One small hardening that's worth considering for the rename-durability claim: os.replace(tmp_path, self._path)
dir_fd = os.open(str(self._path.parent), os.O_RDONLY)
try:
os.fsync(dir_fd)
finally:
os.close(dir_fd)For |
Without this, on ext4 (and similar) filesystems the rename ack does not guarantee durability across power loss — a crash can revert to a state where the temp file is present and the target is at the old version. Suggested by @jphein on MemPalace#1215.
|
@jphein Good catch — the parent-dir fsync is the natural completion of "atomic write means durable write." Even though entity_registry.json is regenerable, leaving the gap would be inconsistent with the PR framing. Folded it in as 490f585. Test suite: 1317 passed. (Two unrelated pre-existing failures in Ready for re-review. |
|
Thanks @arnoldwender — |
|
Friendly ping — CI is green across all platforms and @jphein gave +1 on 2026-04-30. Ready to merge whenever a maintainer has a moment. |
…rash EntityRegistry.save() called Path.write_text() directly, which truncates the target file and then writes — so a crash mid-write (power loss, OOM, filesystem-full mid-flush) leaves an empty or half-written entity_registry.json. The whole people/projects map is lost; the system falls back to an empty registry on next load. Switch to the standard atomic-write pattern: serialize to a sibling .tmp file in the same directory (so os.replace stays on one filesystem), fsync, chmod 0o600, then os.replace over the target. The replace is atomic on POSIX and Windows, so any crash leaves the previous registry intact instead of a truncated file. Tests cover: no leftover .tmp on success, and previous content preserved when os.replace itself raises mid-save.
Without this, on ext4 (and similar) filesystems the rename ack does not guarantee durability across power loss — a crash can revert to a state where the temp file is present and the target is at the old version. Suggested by @jphein on MemPalace#1215.
490f585 to
2e441d1
Compare
What and Why
EntityRegistry.save()callsPath.write_text()directly, which truncates the target file and then writes. A crash between truncate and full-flush (power loss, OOM, filesystem full, kill -9) leaves an empty or half-writtenentity_registry.json. The whole people/projects map is lost; the system silently falls back to an empty registry on nextload()(json.JSONDecodeErroris swallowed inload()).For a tool whose value is built up from months of mining transcripts and disambiguating entities, this is a real data-loss risk on any crash.
Root Cause
mempalace/entity_registry.py:323—self._path.write_text(...)is not atomic. The kernel sees:open(..., 'w')(truncates), thenwrite, thenclose. Any failure between truncate and successful close corrupts the file.Fix
Standard atomic-write pattern:
.tmpfile in the same directory (soos.replacestays on one filesystem)fsyncthe temp filechmod 0o600the temp file before renameos.replace(tmp, target)— atomic on POSIX and WindowsAny crash before
os.replacereturns leaves the previous registry intact. The new registry only becomes visible to readers when the rename has fully landed on disk.Test plan
test_save_is_atomic_does_not_leave_tmp—.tmpsidecar removed on successful savetest_save_preserves_previous_on_serialization_failure— previous content unchanged whenos.replaceraises mid-save (simulates filesystem-full or permission flip after temp write)test_entity_registry.pytests passtest_backends.py::test_pin_hnsw_threads*failures unrelated — chromadbconfiguration_json["hnsw"]schema drift, also fails on cleandevelop)