fix: use epsilon comparison for mtime dedup + add bulk pre-fetch#483
fix: use epsilon comparison for mtime dedup + add bulk pre-fetch#483jphein wants to merge 1 commit intoMemPalace:mainfrom
Conversation
Float equality on mtime fails due to JSON round-trip precision loss, causing every file to be re-mined on each run. Use epsilon < 0.01. Also adds bulk_check_mined() for fetching all source_file/mtime pairs in paginated batches — turns 25K individual DB queries into ~5 fetches. Fixes MemPalace#475 Co-Authored-By: Claude Opus 4.6 <[email protected]>
There was a problem hiding this comment.
Pull request overview
Fixes deduplication behavior in mempalace’s ChromaDB-backed “already mined” check by avoiding strict float equality on mtimes, and introduces a helper intended to reduce per-file ChromaDB lookups by prefetching mined file metadata in batches.
Changes:
- Replace strict
stored_mtime == current_mtimewith an epsilon-based comparison to tolerate JSON float round-trip precision loss. - Add
bulk_check_mined()to page through collection metadata and build a local{source_file: source_mtime}map for faster dedup checks.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return False | ||
| current_mtime = os.path.getmtime(source_file) | ||
| return float(stored_mtime) == current_mtime | ||
| return abs(float(stored_mtime) - current_mtime) < 0.01 | ||
| return True |
There was a problem hiding this comment.
The 0.01s absolute tolerance can treat legitimately modified files as unchanged on filesystems with sub-10ms mtime resolution (changes within that window would be skipped). Consider using math.isclose(..., rel_tol=0, abs_tol=...) with a much smaller abs_tol (e.g., microseconds) and/or defining a named constant so the tolerance is explicit and easy to tune.
| if check_mtime: | ||
| stored_meta = results.get("metadatas", [{}])[0] | ||
| stored_mtime = stored_meta.get("source_mtime") | ||
| if stored_mtime is None: | ||
| return False | ||
| current_mtime = os.path.getmtime(source_file) | ||
| return float(stored_mtime) == current_mtime | ||
| return abs(float(stored_mtime) - current_mtime) < 0.01 | ||
| return True |
There was a problem hiding this comment.
There’s existing test coverage for file_already_mined(check_mtime=True), but none that reproduces the JSON/float round-trip precision mismatch this change is meant to fix. Adding a regression case (e.g., store source_mtime as a float/string with an extra tiny epsilon compared to os.path.getmtime) would ensure this doesn’t regress back to strict equality later.
| def bulk_check_mined(collection, filepaths: list[str]) -> dict[str, float]: | ||
| """Pre-fetch source_file/source_mtime pairs for all documents in the collection. | ||
|
|
||
| Returns a dict mapping source_file -> source_mtime (as float) for every | ||
| document that has both fields. Callers can check membership and compare | ||
| mtimes locally instead of issuing one ChromaDB query per file. | ||
|
|
||
| The *filepaths* argument is accepted for API symmetry but the function | ||
| fetches the full collection in paginated batches (like palace_graph.py) | ||
| since a WHERE-IN filter on thousands of paths is not supported by ChromaDB. | ||
| """ |
There was a problem hiding this comment.
The PR description claims the miner now uses bulk prefetch to avoid per-file DB queries, but this change only adds bulk_check_mined() without any call-site updates. As-is, the 25K-query performance issue remains unless the miner (and/or other callers) is updated to use this helper.
| def bulk_check_mined(collection, filepaths: list[str]) -> dict[str, float]: | ||
| """Pre-fetch source_file/source_mtime pairs for all documents in the collection. | ||
|
|
||
| Returns a dict mapping source_file -> source_mtime (as float) for every | ||
| document that has both fields. Callers can check membership and compare | ||
| mtimes locally instead of issuing one ChromaDB query per file. | ||
|
|
||
| The *filepaths* argument is accepted for API symmetry but the function | ||
| fetches the full collection in paginated batches (like palace_graph.py) | ||
| since a WHERE-IN filter on thousands of paths is not supported by ChromaDB. |
There was a problem hiding this comment.
bulk_check_mined() takes filepaths but ignores it, which makes the API easy to misuse (callers may expect it to only fetch those paths, but it always scans the entire collection). If it must remain for symmetry, consider renaming to _filepaths and/or making it optional (filepaths: Sequence[str] | None = None) to signal it’s intentionally unused.
| mined: dict[str, float] = {} | ||
| try: | ||
| total = collection.count() | ||
| offset = 0 | ||
| while offset < total: | ||
| batch = collection.get(limit=1000, offset=offset, include=["metadatas"]) | ||
| for meta in batch["metadatas"]: | ||
| src = meta.get("source_file") | ||
| mtime = meta.get("source_mtime") | ||
| if src and mtime is not None: | ||
| mined[src] = float(mtime) | ||
| if not batch["ids"]: | ||
| break | ||
| offset += len(batch["ids"]) | ||
| except Exception: | ||
| pass | ||
| return mined |
There was a problem hiding this comment.
bulk_check_mined() is new behavior but currently has no tests. Since the repo already tests ChromaDB interactions in tests/test_miner.py, adding a small test that seeds a collection with a few metadatas and asserts the returned mapping (and pagination behavior) would help prevent regressions.
web3guru888
left a comment
There was a problem hiding this comment.
Two solid improvements here. Comments on each:
Epsilon mtime comparison — Changing float(stored_mtime) == current_mtime to abs(...) < 0.01 is the right fix. Floating point equality on timestamps is fragile, especially across filesystems (FAT32 has 2-second resolution, ext4 stores nanoseconds). The 0.01s epsilon is reasonable — it's well above float rounding error but below any meaningful file change interval. Worth noting in a code comment that this is intentional tolerance, not a magic number.
bulk_check_mined() — Good optimization. Fetching the full collection in paginated batches and building a local dict avoids N+1 queries. A couple of thoughts:
-
The
filepathsparameter is accepted but unused (doc says "for API symmetry"). This might confuse callers who assume it filters. Consider either removing the param or adding a# TODO: filter when ChromaDB supports WHERE-IN on stringscomment. -
The bare
except Exception: passin the bulk function is pretty broad — if ChromaDB fails partway through, you get a partial dict with no indication it's incomplete. Logging a warning (like PR #482's_iter_metadatasdoes) would help with debugging. -
The pagination pattern here (
limit=1000,offset-based) is the same as #482's approach — might be worth coordinating to share a utility.
Overall this is a real improvement for repos with lots of files. We do similar bulk pre-fetching in our integration to avoid per-file round-trips.
🔭 Reviewed as part of the MemPalace-AGI integration project — autonomous research with perfect memory. Community interaction updates are posted regularly on the dashboard.
|
Superseded by #493 which includes this fix along with other improvements. |
|
Addressing Copilot's review points:
Note: #562 was split into 6 focused PRs per maintainer request. |
- Batch upserts (100 docs/call) in both general miner and convo_miner instead of one-at-a-time writes — 3-5x faster on large projects - _prepare_file() separates read/chunk (thread-safe) from write, enabling ThreadPoolExecutor concurrency with --workers flag - bulk_check_mined() pre-fetches all source_file/mtime pairs in one paginated scan instead of per-file ChromaDB queries - Epsilon mtime comparison (abs < 0.01) instead of float == (MemPalace#483) - Cosine distance default for new collections (hnsw:space=cosine) - SKIP_PATTERNS/JUNK_FILE_SIZE filter minified JS, lockfiles, large dumps - detect_room() uses word-boundary regex and exact-match priority - chunk_text() accepts configurable size/overlap/min params - Entity detector STOPWORDS expanded (73 technical terms) - Configurable chunk_size/chunk_overlap/min_chunk_size in palace config Split from MemPalace#562 per maintainer request. Closes MemPalace#483, MemPalace#568. Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Batch upserts (100 docs/call) in both general miner and convo_miner instead of one-at-a-time writes — 3-5x faster on large projects - _prepare_file() separates read/chunk (thread-safe) from write, enabling ThreadPoolExecutor concurrency with --workers flag - bulk_check_mined() pre-fetches all source_file/mtime pairs in one paginated scan instead of per-file ChromaDB queries - Epsilon mtime comparison (abs < 0.01) instead of float == (MemPalace#483) - Cosine distance default for new collections (hnsw:space=cosine) - SKIP_PATTERNS/JUNK_FILE_SIZE filter minified JS, lockfiles, large dumps - detect_room() uses word-boundary regex and exact-match priority - chunk_text() accepts configurable size/overlap/min params - Entity detector STOPWORDS expanded (73 technical terms) - Configurable chunk_size/chunk_overlap/min_chunk_size in palace config Split from MemPalace#562 per maintainer request. Closes MemPalace#483, MemPalace#568. Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Batch upserts (100 docs/call) in both general miner and convo_miner instead of one-at-a-time writes — 3-5x faster on large projects - _prepare_file() separates read/chunk (thread-safe) from write, enabling ThreadPoolExecutor concurrency with --workers flag - bulk_check_mined() pre-fetches all source_file/mtime pairs in one paginated scan instead of per-file ChromaDB queries - Epsilon mtime comparison (abs < 0.01) instead of float == (MemPalace#483) - Cosine distance default for new collections (hnsw:space=cosine) - SKIP_PATTERNS/JUNK_FILE_SIZE filter minified JS, lockfiles, large dumps - detect_room() uses word-boundary regex and exact-match priority - chunk_text() accepts configurable size/overlap/min params - Entity detector STOPWORDS expanded (73 technical terms) - Configurable chunk_size/chunk_overlap/min_chunk_size in palace config Split from MemPalace#562 per maintainer request. Closes MemPalace#483, MemPalace#568. Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Batch upserts (100 docs/call) in both general miner and convo_miner instead of one-at-a-time writes — 3-5x faster on large projects - _prepare_file() separates read/chunk (thread-safe) from write, enabling ThreadPoolExecutor concurrency with --workers flag - bulk_check_mined() pre-fetches all source_file/mtime pairs in one paginated scan instead of per-file ChromaDB queries - Epsilon mtime comparison (abs < 0.01) instead of float == (MemPalace#483) - Cosine distance default for new collections (hnsw:space=cosine) - SKIP_PATTERNS/JUNK_FILE_SIZE filter minified JS, lockfiles, large dumps - detect_room() uses word-boundary regex and exact-match priority - chunk_text() accepts configurable size/overlap/min params - Entity detector STOPWORDS expanded (73 technical terms) - Configurable chunk_size/chunk_overlap/min_chunk_size in palace config Split from MemPalace#562 per maintainer request. Closes MemPalace#483, MemPalace#568. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Summary
float(stored_mtime) == current_mtimeinpalace.py:68fails due to JSON round-trip precision loss, causing every file to be re-mined on each run. Fixed with epsilon< 0.01.bulk_check_mined()— fetches all source_file/mtime pairs in paginated batches (matchingpalace_graph.pypattern), turning 25K individual DB queries into ~5 paginated fetches.Fixes #475
Test plan
🤖 Generated with Claude Code