Content hash DB with bloom filter#424
Conversation
- Fix check_and_add() bug: verify with _hash_set for O(1) lookup before bloom - Add flush() method for batch saving (saves to disk only when needed) - Decouple: move BloomFilter and ContentHashDB to dedicated content_hash.py - Update tests to use pytest tmp_path fixture - Fix exception handling to (OSError, IOError) - Add file_already_mined() fallback to ChromaDB as safety net
web3guru888
left a comment
There was a problem hiding this comment.
Review: Content Hash DB with Bloom Filter
This directly parallels what we built in our integration, so I can offer a detailed comparison.
Architecture Comparison
Your approach: SHA256 content hash → bloom filter fast-path → JSON-backed ContentHashDB → ChromaDB fallback
Our approach: Content hash (SHA256) → bloom filter pre-check → tiered cosine similarity dedup (hard=0.86, soft=0.55 thresholds) scoped to wing+room
The key difference: your approach catches exact content duplicates at the file level (same bytes = same hash), while ours catches semantic near-duplicates at the chunk level (different words, same meaning). Both are needed for production — yours is the I/O optimization layer, ours is the embedding quality layer.
Code Review
BloomFilter implementation — Clean. Using md5(item + str(i)) for hash functions is the standard "double hashing with string salt" approach. Reasonable defaults (100k capacity, 1% FPR).
check_and_add() correctness — Good that you verify against _hash_set (O(1) set lookup) rather than relying solely on the bloom filter. The bloom filter is correctly used as a fast negative check only. The fix note about this is appreciated.
json.dump({"array_size": self.size, "hash_count": self.hash_count, "array": self.array}, f)With 100k capacity and 1% FPR, self.size = ~958,506 booleans. That's a 958k-element JSON array of true/false values → ~6-7 MB on disk. This works fine, but consider using a bitarray or struct.pack for the save/load path if performance becomes an issue at scale. For now it's fine — correctness over optimization.
mine() integration:
The check_and_add() call happens before processing, then flush() at the end. If the process crashes mid-mine, the in-memory bloom filter has entries that never persisted. On restart, those files will be re-processed (which is safe — worst case is redundant work). But if _save() happened to persist the JSON hashes dict without the bloom, the bloom becomes stale. Consider calling flush() periodically (e.g., every N files) rather than only at the end.
Missing: import json in miner.py — The diff adds import json to miner.py but I don't see it used there. The content_hash module has its own json import. Is the miner import leftover from an earlier iteration?
Test coverage — Good. test_different_files_same_content is the critical test for content-hash dedup and it's passing. The test_false_positive_handled test name is slightly misleading (it tests normal add flow, not actual false positive recovery), but the logic is sound.
Integration suggestion
This pairs well with #280's semantic dedup. Content hash catches byte-identical duplicates at I/O time (fast, O(1)), and semantic dedup catches near-identical content at embedding time (slower, but catches paraphrases). Consider noting this complementarity in the README.
🔭 Reviewed as part of the MemPalace-AGI integration project — autonomous research with perfect memory. Community interaction updates are posted regularly on the dashboard.
Use bloom filter as first check, then verify with _hash_set: - If in bloom and in _hash_set: duplicate - If in bloom but not in _hash_set: add to hash set (false positive case) - If not in bloom: add to both bloom and hash set
Load hashes from content DB, then rebuild bloom filter from them instead of loading stale bloom file
- Use sqlite3 instead of JSON for persistent storage - Add close() method to properly close DB connection - Change file extension from .json to .db - Add hashes property for backward compatibility
… stale rebuilds - Add ContentHashDB for O(1) duplicate detection via BloomFilter - Fix miners to check version gate BEFORE hash DB to allow stale rebuilds - Add hash_db.record() after registry writes for short files - Handle IntegrityError on duplicate insert in content_hash.py - Fix dry_run path missing hash_db initialization in miner.py
Summary
Changes
Why this matters
Previously, mempalace mine would re-read files even if they had not changed. Now it:
This makes re-mining projects much faster and avoids redundant I/O.
Fixes the content hash database with bloom filter feature from PR #282 with the following improvements:
check_and_add()now correctly verifies content hashes using_hash_setfor O(1) lookup before relying on bloom filter (prevents false positive duplicates)flush()method to batch save bloom filter to disk only after batch operationsfile_already_mined()fallback to ChromaDB when JSON is deleted/corruptedBloomFilterandContentHashDBclasses to dedicatedcontent_hash.pymoduletmp_pathfixtureexcept Exceptionto(OSError, IOError)