Skip to content

feat: palace export/import for backup and migration#435

Closed
Nitrogonza9 wants to merge 1 commit intoMemPalace:developfrom
Nitrogonza9:feat/palace-export-import
Closed

feat: palace export/import for backup and migration#435
Nitrogonza9 wants to merge 1 commit intoMemPalace:developfrom
Nitrogonza9:feat/palace-export-import

Conversation

@Nitrogonza9
Copy link
Copy Markdown

Summary

There's currently no way to back up, migrate, or share palace data. If ChromaDB or SQLite gets corrupted, everything is lost. This PR adds portable export/import.

  • mempalace export backup.json — exports all drawers + KG entities + KG triples to a single JSON file
  • mempalace import backup.json — restores into a new or existing palace with automatic deduplication
  • MCP tools: mempalace_export and mempalace_import for AI agents

Export format

Self-describing JSON with version field for forward compatibility:

{
  "format": "mempalace_export",
  "version": 1,
  "exported_at": "2026-04-09T...",
  "drawers": [{"id": "...", "document": "...", "metadata": {...}}],
  "kg_entities": [...],
  "kg_triples": [...]
}

Key behaviors

  • Idempotent import: existing drawers are detected by ID and skipped
  • Batched I/O: reads/writes in 500-item batches for large palaces
  • Format validation: rejects unknown formats and future versions
  • Graceful errors: missing files, empty palaces, and corrupted data handled cleanly

Test plan

  • pytest tests/test_exporter.py -v — 13 tests pass
  • pytest tests/ -v — full suite 547 passed, 0 failed
  • ruff check — no lint errors
  • No new dependencies
  • Full round-trip test: export → import into new palace → verify data matches
  • Skip-existing deduplication tested

🤖 Generated with Claude Code

New exporter.py module with export_palace() and import_palace() functions.
Exports all palace data (ChromaDB drawers + SQLite knowledge graph) to a
single portable JSON file. Import restores into a new or existing palace
with automatic deduplication (skip-existing drawers).

Export format is self-describing with version field for forward compat.
Import validates format and version before proceeding.

New CLI commands:
  mempalace export backup.json
  mempalace import backup.json

New MCP tools: mempalace_export and mempalace_import for AI agents.

13 tests covering export, import, round-trip fidelity, skip-existing,
format validation, empty palaces, and error paths.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@bmaltais
Copy link
Copy Markdown

Similar to #453

Copy link
Copy Markdown

@web3guru888 web3guru888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overlaps with #453 (scokeepa) — both add export/import. Key differences:

#435 (this PR) #453 (scokeepa)
Format Single JSON file JSONL per wing/room
KG export ✅ entities + triples ❌ drawers only
MCP tools mempalace_export/import ❌ CLI only
Dedup on import By drawer ID By drawer ID
Git workflow Less friendly (single large file) Git-friendly (per-room files)
Backup command mempalace backup
Tests 13 tests + round-trip None visible

From our integration perspective, KG export is critical. We have 710 entities and 1,014 triples — losing those on a backup/restore cycle would be painful. This PR handles it; #453 does not.

A few code-level notes:

  1. Import batch size = 100 — This is sensible. #453 uses 5000 which may cause OOM on constrained systems.

  2. Format versioning — The version: 1 field with forward-compat rejection is the right pattern. Future changes can bump the version and add migration logic.

  3. Single JSON file for large palaces — A 39k-drawer palace (per #427) could produce a massive JSON file. Consider streaming JSONL for the drawers array, or at minimum document the expected file sizes. Our 208-discovery palace would be manageable, but production palaces could hit hundreds of MB.

  4. KG triple import relies on entity nameskg.add_triple(triple["subject"], ...) uses the entity name, which means the entity must already exist. The import creates entities first, then triples — correct ordering. But if an entity name collision occurs (different entity, same name), triples could be mislinked. Entity ID-based import would be safer.

Well-tested with round-trip verification. I'd lean toward this PR for the KG support, possibly incorporating #453's JSONL format for the drawers portion.

🔭 Reviewed as part of the MemPalace-AGI integration project — autonomous research with perfect memory. Community interaction updates are posted regularly on the dashboard.

@Nitrogonza9
Copy link
Copy Markdown
Author

Thanks @web3guru888 — great comparison with #453. The overlap is real, so let me address the tradeoffs honestly.

Where #453 is better: JSONL per wing/room is genuinely more git-friendly and the mempalace backup command is good UX. For large palaces, streaming writes per-room avoids the single-large-file problem.

Where this PR is better: KG export (entities + triples), MCP tools, test coverage, and format versioning. As you noted, losing 710 entities and 1,014 triples on backup/restore would be painful.

JSONL streaming for large palaces — valid concern. A 39k-drawer palace would produce a large JSON file. For v1 I'll add a size estimate warning in the CLI output. Streaming JSONL for the drawers array is the right long-term fix, but it changes the format spec — better as a version 2.

Entity name collisions on import — noted. You're right that name-based triple import can mislink if two different entities share a name. The KG already normalizes IDs from names (alice = Alice), so this is only a problem across palaces with genuinely different entities sharing names. I'll add a TODO for ID-based import as a safety improvement.

Ideal outcome: Merge the best of both PRs. I'm happy to incorporate #453's JSONL per-room format for drawers while keeping the KG export, MCP tools, and test coverage from this PR. If @scokeepa and the maintainers are open to it, I can do the merge work.

— Gonzalo

@web3guru888
Copy link
Copy Markdown

The hybrid approach makes a lot of sense — JSONL per-room drawers for streaming + git-friendliness, KG export, MCP tools, and format versioning from this PR. That would be the best of both.

If @scokeepa and @milla-jovovich are open to a merge, it's worth a direct ping in #453 to propose the consolidation. Having two competing export PRs in review simultaneously is going to be hard for maintainers to untangle, and the community benefits from one well-designed export spec rather than two partial ones.

The ID-based triple import TODO is worth flagging in the code as a rather than just tracking it mentally — makes it visible to future contributors.

@scokeepa
Copy link
Copy Markdown

Hey @Nitrogonza9 — I'm the author of #453. The hybrid approach you proposed sounds great, I'm fully on board.

What I'd suggest:

This gives users two complementary workflows:

Track Format Use case
Daily sync JSONL export + git Cross-device, incremental, lightweight
Full restore Binary zip backup Fast recovery, no re-embedding

Happy to have you drive the merge work — you clearly have a good grasp of both PRs. I can review and test on my end once you have a combined branch. Let me know if you'd prefer to base it on #435 or #453, either works for me.

cc @milla-jovovich @bmaltais

Nitrogonza9 added a commit to Nitrogonza9/mempalace that referenced this pull request Apr 10, 2026
…emPalace#453)

Combines the strengths of PR MemPalace#435 and PR MemPalace#453 (by @scokeepa) into a
unified export/import/backup story:

Two complementary export tracks:
- Single JSON file (MemPalace#435): drawers + KG + metadata in one portable file
  Best for full backup, MCP-driven workflows, format versioning
- JSONL per wing/room (MemPalace#453): git-friendly directory layout
  Best for cross-device sync via git, incremental updates
  Supports KG export to _kg.json alongside drawer files

CLI auto-detects format from output path:
- mempalace export backup.json    → single JSON
- mempalace export ./sync/        → JSONL per wing/room
- mempalace export ./sync --format jsonl  (explicit override)

New binary backup track (from MemPalace#453):
- mempalace backup            → directory copy (fast restore)
- mempalace backup --zip      → zip archive
- mempalace backup --max-backups 3   → auto-prune old backups

Three new MCP tools: mempalace_export, mempalace_import, mempalace_backup
All auto-detect format and handle KG when available.

26 tests across all paths: single-JSON, JSONL, auto-detection, backup,
round-trip fidelity, KG preservation. No new dependencies.

Co-Authored-By: scokeepa <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@Nitrogonza9
Copy link
Copy Markdown
Author

@scokeepa @web3guru888 — hybrid branch ready for review:

Branch: `Nitrogonza9:feat/palace-export-import-v2`

What's in it

Two export tracks (auto-detected from output path):
```bash
mempalace export backup.json # single JSON file (drawers + KG)
mempalace export ./sync/ # JSONL per wing/room (git-friendly)
mempalace export ./sync --format jsonl # explicit override
```

Binary backup track (from #453):
```bash
mempalace backup # directory copy
mempalace backup --zip # zip archive
mempalace backup --max-backups 3
```

Three MCP tools: `mempalace_export`, `mempalace_import`, `mempalace_backup` — all auto-detect format.

Code attribution

Tests

26 tests covering both formats, auto-detection, backup, round-trip, and KG preservation. Full suite passing. Ruff clean. Zero new dependencies.

Happy to:

  1. Open this as a new PR replacing both feat: palace export/import for backup and migration #435 and feat: add backup, export, and import commands #453
  2. Or push it onto either existing branch

Whichever the maintainers prefer. cc @milla-jovovich @bmaltais

— Gonzalo

@scokeepa
Copy link
Copy Markdown

@Nitrogonza9 — just reviewed the feat/palace-export-import-v2 branch. The hybrid design is solid and the auto-detection is a nice touch. A few things I'd like to see before this goes up as a new PR:

1. Backup integrity validation (from #448 feedback)

After a zip backup, we should verify the archive isn't corrupted. At minimum:

  • Verify SQLite file passes PRAGMA integrity_check
  • Verify HNSW index file exists in the archive

A corrupted backup that passes silently is worse than no backup. Something like:

def _validate_backup(backup_path: Path, zip_mode: bool) -> list[str]:
    """Quick integrity check after backup."""
    errors = []
    if zip_mode:
        import zipfile
        try:
            with zipfile.ZipFile(backup_path, 'r') as zf:
                bad = zf.testzip()
                if bad:
                    errors.append(f"Corrupt file in archive: {bad}")
                names = zf.namelist()
                if not any('chroma.sqlite3' in n for n in names):
                    errors.append("SQLite file missing from backup")
        except Exception as e:
            errors.append(f"Archive validation failed: {e}")
    return errors

2. Embedding not included — add user-visible note

The JSONL and JSON exports don't include embedding vectors. On import, ChromaDB re-embeds everything, which is slow for large palaces and produces different vectors if the embedding model changes. Worth adding a note in the CLI output:

Note: embeddings are not included. Import will re-embed using the configured model.

3. max_backups in config (from #448 feedback)

Right now max_backups is only a function parameter / CLI flag. Users with large palaces may want to set this once in config.json rather than passing --max-backups every time. A backup.max_retained field in MempalaceConfig would be ideal.

4. KG triple import — entity name collision TODO

You mentioned this earlier — the add_triple(triple["subject"], ...) approach can mislink if two different entities share a name across palaces. Worth adding a # TODO: ID-based triple import for cross-palace safety in the code so future contributors can find it.

Happy to help test once these are addressed. Looking forward to the consolidated PR.

@web3guru888
Copy link
Copy Markdown

@Nitrogonza9 — the consolidated design looks right to me. The auto-detection from output path is clean; users shouldn't have to think about format flags unless they want to override.

A few observations from our side:

Export format choice: The JSONL-per-wing approach is compelling for large palaces (ours has 710 KG entities across 5 wings). Diff-friendly exports matter a lot when you're syncing state across environments or doing incremental backups alongside git-tracked content. Good call making it the path-based default.

KG round-trip correctness: @scokeepa's point about entity name collisions on import is the right call to flag. We ran into a similar edge case when importing across wings — subject/object names that are unique within one palace aren't guaranteed globally. A # TODO marker is fine for now; the current use case (single-palace backup/restore) is the common path.

Backup validation: Strongly agree with the post-backup integrity check. A corrupt zip that reports success is a genuinely bad failure mode. The zipfile.testzip() approach is the right lightweight check — we use a similar pattern before trusting any state snapshots.

Missing embedding note: Good suggestion from @scokeepa. Worth surfacing this in the CLI output, especially if the import is going into a new palace with a different default model configured. Re-embedding is usually fine but users need to know it's happening.

On the PR structure: if you're opening a new consolidated PR, I'd vote for that over trying to merge into either existing branch — the combined design is clearly better than either standalone approach. Would make reviewing cleaner too.

Happy to test the round-trip when the new PR is up.

scokeepa added a commit to scokeepa/mempalace that referenced this pull request Apr 10, 2026
… validation

Hybrid design combining PR MemPalace#453 (JSONL, backup) and PR MemPalace#435 (KG, MCP, versioning):

- Two export formats: single JSON (drawers + KG) and JSONL per wing/room (git-friendly)
- Auto-detection from output path (.json file vs directory)
- KG export/import: entities + triples preserved across backup/restore
- Binary backup with post-backup integrity validation (SQLite + zip)
- Three MCP tools: mempalace_export, mempalace_import, mempalace_backup
- Format versioning (version: 1) with forward-compat rejection
- Configurable max_backups via config.json (backup.max_retained)
- Embedding-not-included warning surfaced in CLI output
- 29 tests: round-trip, dedup, format validation, auto-detection, backup integrity

Co-authored-by: Nitrogonza9 <[email protected]>
@bensig bensig changed the base branch from main to develop April 11, 2026 22:22
@bensig
Copy link
Copy Markdown
Collaborator

bensig commented Apr 12, 2026

Closing — this is superseded by recently merged PRs to develop. Thank you for the contribution!

@bensig bensig closed this Apr 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants