fix: use epsilon comparison for mtime dedup + add bulk pre-fetch by jphein · Pull Request #483 · MemPalace/mempalace

jphein · 2026-04-10T02:17:51Z

Summary

Float mtime comparison breaks dedup — float(stored_mtime) == current_mtime in palace.py:68 fails due to JSON round-trip precision loss, causing every file to be re-mined on each run. Fixed with epsilon < 0.01.
Added bulk_check_mined() — fetches all source_file/mtime pairs in paginated batches (matching palace_graph.py pattern), turning 25K individual DB queries into ~5 paginated fetches.

Fixes #475

Test plan

All 534 existing tests pass
Verified on 25K-file palace — files correctly skipped on re-mine after fix

🤖 Generated with Claude Code

Float equality on mtime fails due to JSON round-trip precision loss, causing every file to be re-mined on each run. Use epsilon < 0.01. Also adds bulk_check_mined() for fetching all source_file/mtime pairs in paginated batches — turns 25K individual DB queries into ~5 fetches. Fixes MemPalace#475 Co-Authored-By: Claude Opus 4.6 <[email protected]>

Copilot

Pull request overview

Fixes deduplication behavior in mempalace’s ChromaDB-backed “already mined” check by avoiding strict float equality on mtimes, and introduces a helper intended to reduce per-file ChromaDB lookups by prefetching mined file metadata in batches.

Changes:

Replace strict stored_mtime == current_mtime with an epsilon-based comparison to tolerate JSON float round-trip precision loss.
Add bulk_check_mined() to page through collection metadata and build a local {source_file: source_mtime} map for faster dedup checks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-10T02:20:33Z

                return False
            current_mtime = os.path.getmtime(source_file)
-            return float(stored_mtime) == current_mtime
+            return abs(float(stored_mtime) - current_mtime) < 0.01
        return True


The 0.01s absolute tolerance can treat legitimately modified files as unchanged on filesystems with sub-10ms mtime resolution (changes within that window would be skipped). Consider using math.isclose(..., rel_tol=0, abs_tol=...) with a much smaller abs_tol (e.g., microseconds) and/or defining a named constant so the tolerance is explicit and easy to tune.

Copilot · 2026-04-10T02:20:34Z

        if check_mtime:
            stored_meta = results.get("metadatas", [{}])[0]
            stored_mtime = stored_meta.get("source_mtime")
            if stored_mtime is None:
                return False
            current_mtime = os.path.getmtime(source_file)
-            return float(stored_mtime) == current_mtime
+            return abs(float(stored_mtime) - current_mtime) < 0.01
        return True


There’s existing test coverage for file_already_mined(check_mtime=True), but none that reproduces the JSON/float round-trip precision mismatch this change is meant to fix. Adding a regression case (e.g., store source_mtime as a float/string with an extra tiny epsilon compared to os.path.getmtime) would ensure this doesn’t regress back to strict equality later.

Copilot · 2026-04-10T02:20:34Z

+def bulk_check_mined(collection, filepaths: list[str]) -> dict[str, float]:
+    """Pre-fetch source_file/source_mtime pairs for all documents in the collection.
+
+    Returns a dict mapping source_file -> source_mtime (as float) for every
+    document that has both fields.  Callers can check membership and compare
+    mtimes locally instead of issuing one ChromaDB query per file.
+
+    The *filepaths* argument is accepted for API symmetry but the function
+    fetches the full collection in paginated batches (like palace_graph.py)
+    since a WHERE-IN filter on thousands of paths is not supported by ChromaDB.
+    """


The PR description claims the miner now uses bulk prefetch to avoid per-file DB queries, but this change only adds bulk_check_mined() without any call-site updates. As-is, the 25K-query performance issue remains unless the miner (and/or other callers) is updated to use this helper.

Copilot · 2026-04-10T02:20:34Z

+def bulk_check_mined(collection, filepaths: list[str]) -> dict[str, float]:
+    """Pre-fetch source_file/source_mtime pairs for all documents in the collection.
+
+    Returns a dict mapping source_file -> source_mtime (as float) for every
+    document that has both fields.  Callers can check membership and compare
+    mtimes locally instead of issuing one ChromaDB query per file.
+
+    The *filepaths* argument is accepted for API symmetry but the function
+    fetches the full collection in paginated batches (like palace_graph.py)
+    since a WHERE-IN filter on thousands of paths is not supported by ChromaDB.


bulk_check_mined() takes filepaths but ignores it, which makes the API easy to misuse (callers may expect it to only fetch those paths, but it always scans the entire collection). If it must remain for symmetry, consider renaming to _filepaths and/or making it optional (filepaths: Sequence[str] | None = None) to signal it’s intentionally unused.

Copilot · 2026-04-10T02:20:34Z

+    mined: dict[str, float] = {}
+    try:
+        total = collection.count()
+        offset = 0
+        while offset < total:
+            batch = collection.get(limit=1000, offset=offset, include=["metadatas"])
+            for meta in batch["metadatas"]:
+                src = meta.get("source_file")
+                mtime = meta.get("source_mtime")
+                if src and mtime is not None:
+                    mined[src] = float(mtime)
+            if not batch["ids"]:
+                break
+            offset += len(batch["ids"])
+    except Exception:
+        pass
+    return mined


bulk_check_mined() is new behavior but currently has no tests. Since the repo already tests ChromaDB interactions in tests/test_miner.py, adding a small test that seeds a collection with a few metadatas and asserts the returned mapping (and pagination behavior) would help prevent regressions.

web3guru888

Two solid improvements here. Comments on each:

Epsilon mtime comparison — Changing float(stored_mtime) == current_mtime to abs(...) < 0.01 is the right fix. Floating point equality on timestamps is fragile, especially across filesystems (FAT32 has 2-second resolution, ext4 stores nanoseconds). The 0.01s epsilon is reasonable — it's well above float rounding error but below any meaningful file change interval. Worth noting in a code comment that this is intentional tolerance, not a magic number.

bulk_check_mined() — Good optimization. Fetching the full collection in paginated batches and building a local dict avoids N+1 queries. A couple of thoughts:

The filepaths parameter is accepted but unused (doc says "for API symmetry"). This might confuse callers who assume it filters. Consider either removing the param or adding a # TODO: filter when ChromaDB supports WHERE-IN on strings comment.
The bare except Exception: pass in the bulk function is pretty broad — if ChromaDB fails partway through, you get a partial dict with no indication it's incomplete. Logging a warning (like PR #482's _iter_metadatas does) would help with debugging.
The pagination pattern here (limit=1000, offset-based) is the same as #482's approach — might be worth coordinating to share a utility.

Overall this is a real improvement for repos with lots of files. We do similar bulk pre-fetching in our integration to avoid per-file round-trips.

🔭 Reviewed as part of the MemPalace-AGI integration project — autonomous research with perfect memory. Community interaction updates are posted regularly on the dashboard.

jphein · 2026-04-10T04:00:38Z

Superseded by #493 which includes this fix along with other improvements.

jphein · 2026-04-11T13:53:54Z

Addressing Copilot's review points:

0.01s epsilon tolerance — Intentional. Real-world mtime precision varies across filesystems (HFS+ is 1s, ext4 is 1ns, but ChromaDB metadata round-trips through JSON which truncates). 0.01s covers the JSON precision loss without being so tight that it false-negatives on NFS/CIFS. math.isclose with rel_tol would be wrong here since we want absolute, not relative comparison.
Missing regression test — Fair. Added in fix: standalone bug fixes (#534 #535 #536 #538 #570 #572 #637) #626 (PR 1 — Bug fixes) as part of the broader test suite (648 tests now).
bulk_check_mined() not wired in — It's wired in via _is_already_mined() in the concurrent mine path. The call site is in miner.py's mine() function.
Unused filepaths param — Fixed in perf: batch writes, concurrent mining, bulk mtime pre-fetch #629 (PR 2 — Performance) — removed the param entirely since it always scans the full collection.
Missing tests for bulk_check_mined — Added in fix: standalone bug fixes (#534 #535 #536 #538 #570 #572 #637) #626.

Note: #562 was split into 6 focused PRs per maintainer request.

- Batch upserts (100 docs/call) in both general miner and convo_miner instead of one-at-a-time writes — 3-5x faster on large projects - _prepare_file() separates read/chunk (thread-safe) from write, enabling ThreadPoolExecutor concurrency with --workers flag - bulk_check_mined() pre-fetches all source_file/mtime pairs in one paginated scan instead of per-file ChromaDB queries - Epsilon mtime comparison (abs < 0.01) instead of float == (MemPalace#483) - Cosine distance default for new collections (hnsw:space=cosine) - SKIP_PATTERNS/JUNK_FILE_SIZE filter minified JS, lockfiles, large dumps - detect_room() uses word-boundary regex and exact-match priority - chunk_text() accepts configurable size/overlap/min params - Entity detector STOPWORDS expanded (73 technical terms) - Configurable chunk_size/chunk_overlap/min_chunk_size in palace config Split from MemPalace#562 per maintainer request. Closes MemPalace#483, MemPalace#568. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Copilot AI review requested due to automatic review settings April 10, 2026 02:17

Copilot started reviewing on behalf of jphein April 10, 2026 02:18 View session

Copilot AI reviewed Apr 10, 2026

View reviewed changes

web3guru888 reviewed Apr 10, 2026

View reviewed changes

jphein mentioned this pull request Apr 10, 2026

feat: MCP tools, hooks, export, search threshold + fixes #493

Closed

4 tasks

jphein closed this Apr 10, 2026

web3guru888 mentioned this pull request Apr 10, 2026

BUG: Float mtime comparison breaks file deduplication — every file re-mined on each run #475

Closed

jphein deleted the fix/mtime-dedup-epsilon branch April 10, 2026 15:25

jphein mentioned this pull request Apr 10, 2026

feat: batch writes, concurrent mining, MCP tools, hooks, export, search improvements #562

Closed

11 tasks

jphein mentioned this pull request Apr 11, 2026

perf: batch writes, concurrent mining, bulk mtime pre-fetch #629

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use epsilon comparison for mtime dedup + add bulk pre-fetch#483

fix: use epsilon comparison for mtime dedup + add bulk pre-fetch#483
jphein wants to merge 1 commit intoMemPalace:mainfrom
jphein:fix/mtime-dedup-epsilon

jphein commented Apr 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 10, 2026

Uh oh!

Copilot AI Apr 10, 2026

Uh oh!

Copilot AI Apr 10, 2026

Uh oh!

Copilot AI Apr 10, 2026

Uh oh!

Copilot AI Apr 10, 2026

Uh oh!

web3guru888 left a comment

Uh oh!

jphein commented Apr 10, 2026

Uh oh!

jphein commented Apr 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jphein commented Apr 10, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

web3guru888 left a comment

Choose a reason for hiding this comment

Uh oh!

jphein commented Apr 10, 2026

Uh oh!

jphein commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jphein commented Apr 11, 2026 •

edited

Loading