Skip to content

fix: use epsilon comparison for mtime dedup + add bulk pre-fetch#483

Closed
jphein wants to merge 1 commit intoMemPalace:mainfrom
jphein:fix/mtime-dedup-epsilon
Closed

fix: use epsilon comparison for mtime dedup + add bulk pre-fetch#483
jphein wants to merge 1 commit intoMemPalace:mainfrom
jphein:fix/mtime-dedup-epsilon

Conversation

@jphein
Copy link
Copy Markdown
Collaborator

@jphein jphein commented Apr 10, 2026

Summary

  • Float mtime comparison breaks dedupfloat(stored_mtime) == current_mtime in palace.py:68 fails due to JSON round-trip precision loss, causing every file to be re-mined on each run. Fixed with epsilon < 0.01.
  • Added bulk_check_mined() — fetches all source_file/mtime pairs in paginated batches (matching palace_graph.py pattern), turning 25K individual DB queries into ~5 paginated fetches.

Fixes #475

Test plan

  • All 534 existing tests pass
  • Verified on 25K-file palace — files correctly skipped on re-mine after fix

🤖 Generated with Claude Code

Float equality on mtime fails due to JSON round-trip precision loss,
causing every file to be re-mined on each run. Use epsilon < 0.01.

Also adds bulk_check_mined() for fetching all source_file/mtime pairs
in paginated batches — turns 25K individual DB queries into ~5 fetches.

Fixes MemPalace#475

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Copilot AI review requested due to automatic review settings April 10, 2026 02:17
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes deduplication behavior in mempalace’s ChromaDB-backed “already mined” check by avoiding strict float equality on mtimes, and introduces a helper intended to reduce per-file ChromaDB lookups by prefetching mined file metadata in batches.

Changes:

  • Replace strict stored_mtime == current_mtime with an epsilon-based comparison to tolerate JSON float round-trip precision loss.
  • Add bulk_check_mined() to page through collection metadata and build a local {source_file: source_mtime} map for faster dedup checks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mempalace/palace.py
Comment on lines 66 to 69
return False
current_mtime = os.path.getmtime(source_file)
return float(stored_mtime) == current_mtime
return abs(float(stored_mtime) - current_mtime) < 0.01
return True
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 0.01s absolute tolerance can treat legitimately modified files as unchanged on filesystems with sub-10ms mtime resolution (changes within that window would be skipped). Consider using math.isclose(..., rel_tol=0, abs_tol=...) with a much smaller abs_tol (e.g., microseconds) and/or defining a named constant so the tolerance is explicit and easy to tune.

Copilot uses AI. Check for mistakes.
Comment thread mempalace/palace.py
Comment on lines 62 to 69
if check_mtime:
stored_meta = results.get("metadatas", [{}])[0]
stored_mtime = stored_meta.get("source_mtime")
if stored_mtime is None:
return False
current_mtime = os.path.getmtime(source_file)
return float(stored_mtime) == current_mtime
return abs(float(stored_mtime) - current_mtime) < 0.01
return True
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There’s existing test coverage for file_already_mined(check_mtime=True), but none that reproduces the JSON/float round-trip precision mismatch this change is meant to fix. Adding a regression case (e.g., store source_mtime as a float/string with an extra tiny epsilon compared to os.path.getmtime) would ensure this doesn’t regress back to strict equality later.

Copilot uses AI. Check for mistakes.
Comment thread mempalace/palace.py
Comment on lines +74 to +84
def bulk_check_mined(collection, filepaths: list[str]) -> dict[str, float]:
"""Pre-fetch source_file/source_mtime pairs for all documents in the collection.

Returns a dict mapping source_file -> source_mtime (as float) for every
document that has both fields. Callers can check membership and compare
mtimes locally instead of issuing one ChromaDB query per file.

The *filepaths* argument is accepted for API symmetry but the function
fetches the full collection in paginated batches (like palace_graph.py)
since a WHERE-IN filter on thousands of paths is not supported by ChromaDB.
"""
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description claims the miner now uses bulk prefetch to avoid per-file DB queries, but this change only adds bulk_check_mined() without any call-site updates. As-is, the 25K-query performance issue remains unless the miner (and/or other callers) is updated to use this helper.

Copilot uses AI. Check for mistakes.
Comment thread mempalace/palace.py
Comment on lines +74 to +83
def bulk_check_mined(collection, filepaths: list[str]) -> dict[str, float]:
"""Pre-fetch source_file/source_mtime pairs for all documents in the collection.

Returns a dict mapping source_file -> source_mtime (as float) for every
document that has both fields. Callers can check membership and compare
mtimes locally instead of issuing one ChromaDB query per file.

The *filepaths* argument is accepted for API symmetry but the function
fetches the full collection in paginated batches (like palace_graph.py)
since a WHERE-IN filter on thousands of paths is not supported by ChromaDB.
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bulk_check_mined() takes filepaths but ignores it, which makes the API easy to misuse (callers may expect it to only fetch those paths, but it always scans the entire collection). If it must remain for symmetry, consider renaming to _filepaths and/or making it optional (filepaths: Sequence[str] | None = None) to signal it’s intentionally unused.

Copilot uses AI. Check for mistakes.
Comment thread mempalace/palace.py
Comment on lines +85 to +101
mined: dict[str, float] = {}
try:
total = collection.count()
offset = 0
while offset < total:
batch = collection.get(limit=1000, offset=offset, include=["metadatas"])
for meta in batch["metadatas"]:
src = meta.get("source_file")
mtime = meta.get("source_mtime")
if src and mtime is not None:
mined[src] = float(mtime)
if not batch["ids"]:
break
offset += len(batch["ids"])
except Exception:
pass
return mined
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bulk_check_mined() is new behavior but currently has no tests. Since the repo already tests ChromaDB interactions in tests/test_miner.py, adding a small test that seeds a collection with a few metadatas and asserts the returned mapping (and pagination behavior) would help prevent regressions.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@web3guru888 web3guru888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two solid improvements here. Comments on each:

Epsilon mtime comparison — Changing float(stored_mtime) == current_mtime to abs(...) < 0.01 is the right fix. Floating point equality on timestamps is fragile, especially across filesystems (FAT32 has 2-second resolution, ext4 stores nanoseconds). The 0.01s epsilon is reasonable — it's well above float rounding error but below any meaningful file change interval. Worth noting in a code comment that this is intentional tolerance, not a magic number.

bulk_check_mined() — Good optimization. Fetching the full collection in paginated batches and building a local dict avoids N+1 queries. A couple of thoughts:

  1. The filepaths parameter is accepted but unused (doc says "for API symmetry"). This might confuse callers who assume it filters. Consider either removing the param or adding a # TODO: filter when ChromaDB supports WHERE-IN on strings comment.

  2. The bare except Exception: pass in the bulk function is pretty broad — if ChromaDB fails partway through, you get a partial dict with no indication it's incomplete. Logging a warning (like PR #482's _iter_metadatas does) would help with debugging.

  3. The pagination pattern here (limit=1000, offset-based) is the same as #482's approach — might be worth coordinating to share a utility.

Overall this is a real improvement for repos with lots of files. We do similar bulk pre-fetching in our integration to avoid per-file round-trips.

🔭 Reviewed as part of the MemPalace-AGI integration project — autonomous research with perfect memory. Community interaction updates are posted regularly on the dashboard.

@jphein
Copy link
Copy Markdown
Collaborator Author

jphein commented Apr 10, 2026

Superseded by #493 which includes this fix along with other improvements.

@jphein
Copy link
Copy Markdown
Collaborator Author

jphein commented Apr 11, 2026

Addressing Copilot's review points:

  1. 0.01s epsilon tolerance — Intentional. Real-world mtime precision varies across filesystems (HFS+ is 1s, ext4 is 1ns, but ChromaDB metadata round-trips through JSON which truncates). 0.01s covers the JSON precision loss without being so tight that it false-negatives on NFS/CIFS. math.isclose with rel_tol would be wrong here since we want absolute, not relative comparison.

  2. Missing regression test — Fair. Added in fix: standalone bug fixes (#534 #535 #536 #538 #570 #572 #637) #626 (PR 1 — Bug fixes) as part of the broader test suite (648 tests now).

  3. bulk_check_mined() not wired in — It's wired in via _is_already_mined() in the concurrent mine path. The call site is in miner.py's mine() function.

  4. Unused filepaths param — Fixed in perf: batch writes, concurrent mining, bulk mtime pre-fetch #629 (PR 2 — Performance) — removed the param entirely since it always scans the full collection.

  5. Missing tests for bulk_check_mined — Added in fix: standalone bug fixes (#534 #535 #536 #538 #570 #572 #637) #626.

Note: #562 was split into 6 focused PRs per maintainer request.

jphein added a commit to jphein/mempalace that referenced this pull request Apr 11, 2026
- Batch upserts (100 docs/call) in both general miner and convo_miner
  instead of one-at-a-time writes — 3-5x faster on large projects
- _prepare_file() separates read/chunk (thread-safe) from write,
  enabling ThreadPoolExecutor concurrency with --workers flag
- bulk_check_mined() pre-fetches all source_file/mtime pairs in one
  paginated scan instead of per-file ChromaDB queries
- Epsilon mtime comparison (abs < 0.01) instead of float == (MemPalace#483)
- Cosine distance default for new collections (hnsw:space=cosine)
- SKIP_PATTERNS/JUNK_FILE_SIZE filter minified JS, lockfiles, large dumps
- detect_room() uses word-boundary regex and exact-match priority
- chunk_text() accepts configurable size/overlap/min params
- Entity detector STOPWORDS expanded (73 technical terms)
- Configurable chunk_size/chunk_overlap/min_chunk_size in palace config

Split from MemPalace#562 per maintainer request. Closes MemPalace#483, MemPalace#568.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
jphein added a commit to jphein/mempalace that referenced this pull request Apr 12, 2026
- Batch upserts (100 docs/call) in both general miner and convo_miner
  instead of one-at-a-time writes — 3-5x faster on large projects
- _prepare_file() separates read/chunk (thread-safe) from write,
  enabling ThreadPoolExecutor concurrency with --workers flag
- bulk_check_mined() pre-fetches all source_file/mtime pairs in one
  paginated scan instead of per-file ChromaDB queries
- Epsilon mtime comparison (abs < 0.01) instead of float == (MemPalace#483)
- Cosine distance default for new collections (hnsw:space=cosine)
- SKIP_PATTERNS/JUNK_FILE_SIZE filter minified JS, lockfiles, large dumps
- detect_room() uses word-boundary regex and exact-match priority
- chunk_text() accepts configurable size/overlap/min params
- Entity detector STOPWORDS expanded (73 technical terms)
- Configurable chunk_size/chunk_overlap/min_chunk_size in palace config

Split from MemPalace#562 per maintainer request. Closes MemPalace#483, MemPalace#568.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
jphein added a commit to jphein/mempalace that referenced this pull request Apr 12, 2026
- Batch upserts (100 docs/call) in both general miner and convo_miner
  instead of one-at-a-time writes — 3-5x faster on large projects
- _prepare_file() separates read/chunk (thread-safe) from write,
  enabling ThreadPoolExecutor concurrency with --workers flag
- bulk_check_mined() pre-fetches all source_file/mtime pairs in one
  paginated scan instead of per-file ChromaDB queries
- Epsilon mtime comparison (abs < 0.01) instead of float == (MemPalace#483)
- Cosine distance default for new collections (hnsw:space=cosine)
- SKIP_PATTERNS/JUNK_FILE_SIZE filter minified JS, lockfiles, large dumps
- detect_room() uses word-boundary regex and exact-match priority
- chunk_text() accepts configurable size/overlap/min params
- Entity detector STOPWORDS expanded (73 technical terms)
- Configurable chunk_size/chunk_overlap/min_chunk_size in palace config

Split from MemPalace#562 per maintainer request. Closes MemPalace#483, MemPalace#568.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
jphein added a commit to jphein/mempalace that referenced this pull request Apr 13, 2026
- Batch upserts (100 docs/call) in both general miner and convo_miner
  instead of one-at-a-time writes — 3-5x faster on large projects
- _prepare_file() separates read/chunk (thread-safe) from write,
  enabling ThreadPoolExecutor concurrency with --workers flag
- bulk_check_mined() pre-fetches all source_file/mtime pairs in one
  paginated scan instead of per-file ChromaDB queries
- Epsilon mtime comparison (abs < 0.01) instead of float == (MemPalace#483)
- Cosine distance default for new collections (hnsw:space=cosine)
- SKIP_PATTERNS/JUNK_FILE_SIZE filter minified JS, lockfiles, large dumps
- detect_room() uses word-boundary regex and exact-match priority
- chunk_text() accepts configurable size/overlap/min params
- Entity detector STOPWORDS expanded (73 technical terms)
- Configurable chunk_size/chunk_overlap/min_chunk_size in palace config

Split from MemPalace#562 per maintainer request. Closes MemPalace#483, MemPalace#568.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: Float mtime comparison breaks file deduplication — every file re-mined on each run

3 participants