fix: use upsert and deterministic IDs to prevent data stagnation by igorls · Pull Request #140 · MemPalace/mempalace

igorls · 2026-04-07T20:27:59Z

Problem

1. Data stagnation in miner (HIGH)

miner.add_drawer() uses collection.add(), which throws on duplicate IDs. Modified files never get their palace content updated:

except Exception as e:
    if "already exists" in str(e).lower():
        return False  # ← silently ignored

2. Non-deterministic drawer IDs in MCP (MEDIUM)

tool_add_drawer() generates IDs using content[:100] + datetime.now().isoformat(). Same content at different times → different IDs → duplicate entries.

Fix

Deterministic IDs

-drawer_id = hashlib.md5((content[:100] + datetime.now().isoformat()).encode())
+drawer_id = hashlib.md5(content.encode())

Same content always produces the same ID. Combined with upsert, this makes add_drawer idempotent.

Upsert

Both miner.add_drawer() and tool_add_drawer() now use collection.upsert():

New content → inserted
Existing ID with updated content → updated
Identical content → no-op (same embeddings)

92 tests pass (includes test infrastructure from PR #131).

Copilot

Pull request overview

Updates drawer ingestion to avoid stale data by enabling overwrites on existing IDs, and adds/expands automated tests while migrating packaging metadata toward uv/hatchling.

Changes:

Switch ChromaDB writes from add() to upsert() in the miner and MCP write path, and make MCP drawer IDs deterministic.
Add substantial pytest coverage for MCP server tools, search API, dialect, and knowledge graph behavior with shared fixtures.
Remove requirements.txt and update pyproject.toml build/dev tooling configuration.

Reviewed changes

Copilot reviewed 9 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`mempalace/miner.py`	Uses `collection.upsert()` for drawer writes (intended to allow updates).
`mempalace/mcp_server.py`	Uses deterministic content-based drawer IDs and `upsert()` for MCP-added drawers.
`mempalace/__init__.py`	Bumps `__version__` to 3.0.0.
`tests/conftest.py`	Adds shared fixtures for isolated palace/KG setup and seeded Chroma collections.
`tests/test_mcp_server.py`	Adds unit/integration tests for MCP request dispatch and tool handlers.
`tests/test_searcher.py`	Adds tests for `search_memories()` programmatic API and filters.
`tests/test_knowledge_graph.py`	Adds tests covering entity/triple lifecycle, temporal queries, timeline, stats.
`tests/test_dialect.py`	Adds tests for AAAK dialect compression/encoding/decoding utilities.
`pyproject.toml`	Moves to `hatchling`, adds dependency groups, updates dev deps.
`requirements.txt`	Removed in favor of `pyproject.toml`/`uv.lock`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-07T20:32:05Z

    """Add one drawer to the palace."""
    drawer_id = f"drawer_{wing}_{room}_{hashlib.md5((source_file + str(chunk_index)).encode()).hexdigest()[:16]}"
    try:
-        collection.add(
+        collection.upsert(
            documents=[content],


Switching to collection.upsert() enables updating an existing drawer ID, but process_file() still short-circuits on file_already_mined() (same module) and returns before any chunks are (re)added. As a result, modified files will still never reach this upsert call, so the data-stagnation problem described in the PR remains. Consider changing the skip logic to re-mine when the source file changes (e.g., store and compare a file hash/mtime in metadata, or add a --force/--refresh mode) and optionally delete drawers for chunks that no longer exist after re-chunking.

Copilot · 2026-04-07T20:32:06Z

+def collection(palace_path):
+    """A ChromaDB collection pre-seeded in the temp palace."""
+    client = chromadb.PersistentClient(path=palace_path)
+    col = client.get_or_create_collection("mempalace_drawers")
+    return col


The collection/seeded_collection fixtures create a chromadb.PersistentClient but never explicitly release it. On Windows, ChromaDB can hold file handles after tests finish, and the current shutil.rmtree(..., ignore_errors=True) teardown can silently leave temp directories behind. Consider explicitly dropping references / forcing GC in teardown (or adopting the existing Windows-safe cleanup pattern used elsewhere in the repo) so failures don’t get masked and temp dirs don’t accumulate.

bensig · 2026-04-07T20:44:57Z

Hey Igor — CI is failing, looks like it needs a rebase on main (we've merged a bunch of changes recently). Can you rebase and push?

bensig · 2026-04-07T23:06:07Z

Same situation — your other PRs landed and created conflicts here. One more rebase on main should do it.

bensig · 2026-04-08T17:12:25Z

One more conflict from #135 merging — should be the last one. Sorry for the cascade!

bensig · 2026-04-08T17:40:08Z

Almost there — 6 lint errors, all semicolons in tests (_client, _col = ...; del _client). Split onto two lines and it's good to merge.

MCP tool_add_drawer: - Make drawer_id content-based: hash full content instead of content[:100] + timestamp. Same content → same ID, eliminating TOCTOU race conditions - Switch from col.add() to col.upsert() so re-filing with updated content updates the existing drawer miner.add_drawer: - Switch from collection.add() to collection.upsert() so re-mining a modified file updates instead of silently failing - Remove the try/except catching 'already exists' — upsert handles this naturally Findings: MemPalace#11 (HIGH — add ignores updates), MemPalace#6 (MEDIUM — TOCTOU), MemPalace#13 (MEDIUM — non-deterministic IDs) Includes test infrastructure from PR MemPalace#131. 92 tests pass.

…cleanup ChromaDB handles

… collection

Copilot AI review requested due to automatic review settings April 7, 2026 20:28

Copilot started reviewing on behalf of igorls April 7, 2026 20:28 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

This was referenced Apr 7, 2026

fix: modernize ChromaDB API — get_or_create_collection + upsert #48

Closed

Add content hash database with bloom filter for smarter duplicate detection #115

Closed

adv3nt3 mentioned this pull request Apr 7, 2026

feat: add import support for more AI tool session formats (Cursor, Copilot, Codex, Windsurf, Aider, etc.) #59

Open

igorls force-pushed the fix/data-integrity branch from e6ae81b to e9986bf Compare April 7, 2026 21:28

igorls closed this Apr 7, 2026

igorls force-pushed the fix/data-integrity branch from e9986bf to 45c2c92 Compare April 7, 2026 21:57

igorls reopened this Apr 7, 2026

igorls force-pushed the fix/data-integrity branch from e7e589f to a9199c3 Compare April 8, 2026 06:31

slapglif mentioned this pull request Apr 8, 2026

Hybrid retrieval direction: make the palace wiring production-real without changing the conceptual design #269

Open

igorls force-pushed the fix/data-integrity branch from a9199c3 to 10ea480 Compare April 8, 2026 17:34

igorls added 4 commits April 8, 2026 15:11

fix: address review — re-mine modified files, idempotent add_drawer, …

bf88daa

…cleanup ChromaDB handles

fix: split semicolon statements onto two lines for ruff E702

af42a85

fix: ruff format test_hooks_cli.py and test_knowledge_graph.py

a0bcd0c

igorls force-pushed the fix/data-integrity branch from 6a34cc9 to a0bcd0c Compare April 8, 2026 18:12

fix: use parse_known_args to allow importing mcp_server during pytest…

edf8f36

… collection

bensig merged commit 2c4abb9 into MemPalace:main Apr 8, 2026
4 checks passed

milla-jovovich mentioned this pull request Apr 9, 2026

chore: bump version to 3.1.0 #409

Merged

6 tasks

ichoosetoaccept mentioned this pull request Apr 12, 2026

Port upstream fix: mtime-based re-mining + upsert in miner (upstream #140) ichoosetoaccept/abizza#63

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use upsert and deterministic IDs to prevent data stagnation#140

fix: use upsert and deterministic IDs to prevent data stagnation#140
bensig merged 5 commits intoMemPalace:mainfrom
igorls:fix/data-integrity

igorls commented Apr 7, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Uh oh!

Uh oh!

bensig commented Apr 7, 2026

Uh oh!

bensig commented Apr 7, 2026

Uh oh!

bensig commented Apr 8, 2026

Uh oh!

bensig commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

igorls commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

1. Data stagnation in miner (HIGH)

2. Non-deterministic drawer IDs in MCP (MEDIUM)

Fix

Deterministic IDs

Upsert

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bensig commented Apr 7, 2026

Uh oh!

bensig commented Apr 7, 2026

Uh oh!

bensig commented Apr 8, 2026

Uh oh!

bensig commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

igorls commented Apr 7, 2026 •

edited

Loading