fix: use upsert and deterministic IDs to prevent data stagnation#140
fix: use upsert and deterministic IDs to prevent data stagnation#140bensig merged 5 commits intoMemPalace:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Updates drawer ingestion to avoid stale data by enabling overwrites on existing IDs, and adds/expands automated tests while migrating packaging metadata toward uv/hatchling.
Changes:
- Switch ChromaDB writes from
add()toupsert()in the miner and MCP write path, and make MCP drawer IDs deterministic. - Add substantial pytest coverage for MCP server tools, search API, dialect, and knowledge graph behavior with shared fixtures.
- Remove
requirements.txtand updatepyproject.tomlbuild/dev tooling configuration.
Reviewed changes
Copilot reviewed 9 out of 11 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
mempalace/miner.py |
Uses collection.upsert() for drawer writes (intended to allow updates). |
mempalace/mcp_server.py |
Uses deterministic content-based drawer IDs and upsert() for MCP-added drawers. |
mempalace/__init__.py |
Bumps __version__ to 3.0.0. |
tests/conftest.py |
Adds shared fixtures for isolated palace/KG setup and seeded Chroma collections. |
tests/test_mcp_server.py |
Adds unit/integration tests for MCP request dispatch and tool handlers. |
tests/test_searcher.py |
Adds tests for search_memories() programmatic API and filters. |
tests/test_knowledge_graph.py |
Adds tests covering entity/triple lifecycle, temporal queries, timeline, stats. |
tests/test_dialect.py |
Adds tests for AAAK dialect compression/encoding/decoding utilities. |
pyproject.toml |
Moves to hatchling, adds dependency groups, updates dev deps. |
requirements.txt |
Removed in favor of pyproject.toml/uv.lock. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| """Add one drawer to the palace.""" | ||
| drawer_id = f"drawer_{wing}_{room}_{hashlib.md5((source_file + str(chunk_index)).encode()).hexdigest()[:16]}" | ||
| try: | ||
| collection.add( | ||
| collection.upsert( | ||
| documents=[content], |
There was a problem hiding this comment.
Switching to collection.upsert() enables updating an existing drawer ID, but process_file() still short-circuits on file_already_mined() (same module) and returns before any chunks are (re)added. As a result, modified files will still never reach this upsert call, so the data-stagnation problem described in the PR remains. Consider changing the skip logic to re-mine when the source file changes (e.g., store and compare a file hash/mtime in metadata, or add a --force/--refresh mode) and optionally delete drawers for chunks that no longer exist after re-chunking.
| def collection(palace_path): | ||
| """A ChromaDB collection pre-seeded in the temp palace.""" | ||
| client = chromadb.PersistentClient(path=palace_path) | ||
| col = client.get_or_create_collection("mempalace_drawers") | ||
| return col |
There was a problem hiding this comment.
The collection/seeded_collection fixtures create a chromadb.PersistentClient but never explicitly release it. On Windows, ChromaDB can hold file handles after tests finish, and the current shutil.rmtree(..., ignore_errors=True) teardown can silently leave temp directories behind. Consider explicitly dropping references / forcing GC in teardown (or adopting the existing Windows-safe cleanup pattern used elsewhere in the repo) so failures don’t get masked and temp dirs don’t accumulate.
|
Hey Igor — CI is failing, looks like it needs a rebase on main (we've merged a bunch of changes recently). Can you rebase and push? |
e6ae81b to
e9986bf
Compare
e9986bf to
45c2c92
Compare
|
Same situation — your other PRs landed and created conflicts here. One more rebase on main should do it. |
e7e589f to
a9199c3
Compare
|
One more conflict from #135 merging — should be the last one. Sorry for the cascade! |
a9199c3 to
10ea480
Compare
|
Almost there — 6 lint errors, all semicolons in tests ( |
MCP tool_add_drawer: - Make drawer_id content-based: hash full content instead of content[:100] + timestamp. Same content → same ID, eliminating TOCTOU race conditions - Switch from col.add() to col.upsert() so re-filing with updated content updates the existing drawer miner.add_drawer: - Switch from collection.add() to collection.upsert() so re-mining a modified file updates instead of silently failing - Remove the try/except catching 'already exists' — upsert handles this naturally Findings: MemPalace#11 (HIGH — add ignores updates), MemPalace#6 (MEDIUM — TOCTOU), MemPalace#13 (MEDIUM — non-deterministic IDs) Includes test infrastructure from PR MemPalace#131. 92 tests pass.
…cleanup ChromaDB handles
6a34cc9 to
a0bcd0c
Compare
Problem
1. Data stagnation in miner (HIGH)
miner.add_drawer()usescollection.add(), which throws on duplicate IDs. Modified files never get their palace content updated:2. Non-deterministic drawer IDs in MCP (MEDIUM)
tool_add_drawer()generates IDs usingcontent[:100] + datetime.now().isoformat(). Same content at different times → different IDs → duplicate entries.Fix
Deterministic IDs
Same content always produces the same ID. Combined with upsert, this makes add_drawer idempotent.
Upsert
Both
miner.add_drawer()andtool_add_drawer()now usecollection.upsert():92 tests pass (includes test infrastructure from PR #131).