Skip to content

fix: guard against data loss in repair, migrate, and CLI rebuild#935

Merged
igorls merged 3 commits intoMemPalace:developfrom
shaun0927:fix/repair-crash-safety
Apr 25, 2026
Merged

fix: guard against data loss in repair, migrate, and CLI rebuild#935
igorls merged 3 commits intoMemPalace:developfrom
shaun0927:fix/repair-crash-safety

Conversation

@shaun0927
Copy link
Copy Markdown
Contributor

@shaun0927 shaun0927 commented Apr 16, 2026

Summary

Three data-integrity fixes honoring CLAUDE.md's design principle: "A crash mid-operation must leave the existing palace untouched."

Refs: #934 (parent) — specifically #1151, #1152, #1153.

What changed

mempalace/repair.py — restore from backup on rebuild failure (#1151)

rebuild_index calls delete_collection then upserts drawers in a loop. If any upsert fails partway (OOM, ChromaDB error), the rest of the drawers are gone. The backup copy exists but is never consulted.

try:
    for i in range(0, len(all_ids), batch_size):
        new_col.upsert(...)
except Exception:
    shutil.copy2(backup_path, sqlite_path)  # restore pre-repair state
    raise

mempalace/migrate.py — atomic palace swap (#1152)

shutil.rmtree(palace_path) followed by shutil.move(...) has a window where both copies are gone. Replaced with a rename-aside pattern. The second commit in this PR (659cb81) hardens the cross-device fallback per @qodo-ai-reviewer's feedback:

  • Primary swap uses os.replace (same-filesystem atomic)
  • errno.EXDEV gates the shutil.move fallback so unrelated OSErrors surface instead of getting masked
  • _restore_stale_palace clears any partial destination before the rollback os.replace, and logs both paths if restore itself fails

mempalace/cli.py — defensive batch extraction offset (#1153)

cmd_repair extraction used offset += batch_size (fixed 5000) regardless of actual batch length. Changed to offset += len(batch["ids"]) with an empty-batch guard, matching the pattern at exporter.py:70.

Test plan

  • ruff check mempalace/ — passed
  • ruff format --check mempalace/ — passed
  • pytest tests/test_migrate.py — 5 passed (including 3 new regression tests covering the rollback hardening)

Overlap with open PRs

Per @mvalentsev's review:

Split question

@bensig asked on #934 that each finding get its own PR. This bundles three under the "crash-safety" theme in three independent files. Happy to split into three per-issue PRs if preferred.

- repair.py: wrap upsert loop in try/except; restore from backup on
  failure instead of leaving a partially rebuilt collection
- migrate.py: replace non-atomic rmtree+move with rename-aside swap
  so a crash between the two calls does not destroy both copies
- cli.py: use offset += len(batch["ids"]) with empty-batch guard
  instead of fixed offset += batch_size to prevent skipping drawers
@igorls igorls added bug Something isn't working storage labels Apr 16, 2026
- repair.py: define backup_path before the conditional block so it is
  always in scope when the except handler references it
- migrate.py: restore old palace from .old if both os.rename and
  shutil.move fail during the swap step
Copy link
Copy Markdown
Contributor

@mvalentsev mvalentsev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finding 1 (repair crash safety) overlaps with open PR #239 (rusel95), which rewrites repair to a migrate-to-fresh-palace strategy: reads all data via get(), writes to a temp palace, verifies counts match, then atomically swaps directories. Original palace preserved as .backup with auto-restore on failure.

Findings 2 and 3 look clean to me.

@mvalentsev
Copy link
Copy Markdown
Contributor

Correction to my earlier review:

Finding 2 (migrate swap): PR #890 touches the same swap code path in migrate.py, adding _replace_palace_dir() with retry logic. Different focus (Windows handle cleanup) but related.

Finding 3 (cli.py offset): PRs #239 and #632 both rewrite the repair code path entirely (migrate-to-fresh strategy and nuke-rebuild respectively). If either lands, the extraction loop with the offset bug no longer exists.

@Qodo-Free-For-OSS
Copy link
Copy Markdown

Hi, In migrate(), if the cross-device fallback (shutil.move) partially creates the destination directory and then fails, the rollback os.rename(stale_path, palace_path) will also fail because palace_path already exists, leaving a partial migrated palace at the target and the original only at .old.

Severity: action required | Category: reliability

How to fix: Harden rollback and cleanup

Agent prompt to fix - you can give this to your LLM of choice:

Issue description

mempalace.migrate.migrate() can leave the palace in an inconsistent state if the cross-device swap fallback (shutil.move) fails after partially creating palace_path. The rollback then tries os.rename(stale_path, palace_path) which will fail if palace_path already exists.

Issue Context

The new logic correctly avoids the “both missing” window, but needs to handle the case where shutil.move(temp_palace, palace_path) partially copies and creates palace_path before raising.

Fix Focus Areas

  • mempalace/migrate.py[235-246]

What to change

  • In the except Exception: rollback path:
    • If palace_path exists (directory or file), remove it first (e.g., shutil.rmtree(palace_path, ignore_errors=False) / os.remove for file), then restore stale_path into place.
    • Consider wrapping rollback in its own try/except to emit a clear recovery message that names both paths (stale_path and partially-copied palace_path).
  • Optionally, use os.replace for same-filesystem renames, and distinguish cross-device errors (EXDEV) from other OSErrors to avoid masking real rename failures.

Found by Qodo code review

shutil.move() can partially create palace_path before raising, which would
trip a bare os.replace(stale_path, palace_path) rollback (dest exists).

- Switch the primary swap to os.replace so same-filesystem moves stay atomic
- Branch on errno.EXDEV before falling back to shutil.move, so real errors
  (permissions, EIO) surface instead of silently attempting a slow copy
- Extract rollback into _restore_stale_palace which clears any partial
  destination and, if the restore itself fails, logs both stale_path and
  palace_path so the operator can recover by hand

Adds three regression tests covering clean rollback, partial-copy cleanup,
and logged failure on rollback-failure.

Flagged by the Qodo reviewer on MemPalace#935.
@shaun0927 shaun0927 requested a review from igorls as a code owner April 24, 2026 04:12
@shaun0927
Copy link
Copy Markdown
Contributor Author

Pushed 659cb81 addressing both pieces of review feedback.

@qodo-ai-reviewer — rollback hardening:

  • Primary swap moved to os.replace so same-filesystem is atomic
  • errno.EXDEV branch gates the shutil.move fallback — any non-EXDEV OSError now surfaces instead of getting masked by a slow copy
  • Rollback extracted into _restore_stale_palace which clears the partial destination before os.replace(stale, palace), and logs both paths if the restore itself fails
  • Three regression tests in tests/test_migrate.py (clean rollback, partial-copy cleanup, failure-path logging)

@mvalentsev — overlap notes:

Also split the parent #934 into six child issues (#1151#1156; finding 6 withdrawn). Filed corresponding child references: #1151 / #1152 / #1153.

Question for maintainers (@bensig): bensig asked on #934 that each finding get its own PR. This PR bundles three fixes under the "crash-safety" theme in three independent files. Happy to split into three per-child-issue PRs if preferred — say the word and I'll carve this up.

@igorls igorls merged commit 5e57404 into MemPalace:develop Apr 25, 2026
6 checks passed
lealvona pushed a commit to lealvona/mempalace that referenced this pull request Apr 29, 2026
shutil.move() can partially create palace_path before raising, which would
trip a bare os.replace(stale_path, palace_path) rollback (dest exists).

- Switch the primary swap to os.replace so same-filesystem moves stay atomic
- Branch on errno.EXDEV before falling back to shutil.move, so real errors
  (permissions, EIO) surface instead of silently attempting a slow copy
- Extract rollback into _restore_stale_palace which clears any partial
  destination and, if the restore itself fails, logs both stale_path and
  palace_path so the operator can recover by hand

Adds three regression tests covering clean rollback, partial-copy cleanup,
and logged failure on rollback-failure.

Flagged by the Qodo reviewer on MemPalace#935.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working storage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants