Skip to content

Content hash DB with bloom filter#424

Open
daedalus wants to merge 9 commits intoMemPalace:developfrom
daedalus:bloom-filter-v3
Open

Content hash DB with bloom filter#424
daedalus wants to merge 9 commits intoMemPalace:developfrom
daedalus:bloom-filter-v3

Conversation

@daedalus
Copy link
Copy Markdown

@daedalus daedalus commented Apr 9, 2026

Summary

  • Add a persistent content hash database that tracks file content via SHA256 hashes
  • Implement a bloom filter for fast O(1) duplicate checking before reading files from disk
  • Files with identical content (even in different locations) are now skipped automatically

Changes

  • mempalace/miner.py: Added BloomFilter and ContentHashDB classes. Updated mine() to use content hash checking instead of just path-based checking.
  • mempalace/convo_miner.py: Updated to use the same hash database for conversation file mining.
  • tests/test_hashdb.py: Comprehensive tests for bloom filter and content hash DB functionality.

Why this matters

Previously, mempalace mine would re-read files even if they had not changed. Now it:

  • Computes SHA256 hash of file content
  • Uses bloom filter for instant "definitely new" / "might be duplicate" checks
  • Only re-reads files when content hash is genuinely new

This makes re-mining projects much faster and avoids redundant I/O.

Fixes the content hash database with bloom filter feature from PR #282 with the following improvements:

  • Critical bug fix: check_and_add() now correctly verifies content hashes using _hash_set for O(1) lookup before relying on bloom filter (prevents false positive duplicates)
  • Performance: Add flush() method to batch save bloom filter to disk only after batch operations
  • Safety net: Keep file_already_mined() fallback to ChromaDB when JSON is deleted/corrupted
  • Decoupling: Move BloomFilter and ContentHashDB classes to dedicated content_hash.py module
  • Tests: Update to use pytest tmp_path fixture
  • Error handling: Change bare except Exception to (OSError, IOError)

daedalus added 3 commits April 8, 2026 16:12
- Fix check_and_add() bug: verify with _hash_set for O(1) lookup before bloom
- Add flush() method for batch saving (saves to disk only when needed)
- Decouple: move BloomFilter and ContentHashDB to dedicated content_hash.py
- Update tests to use pytest tmp_path fixture
- Fix exception handling to (OSError, IOError)
- Add file_already_mined() fallback to ChromaDB as safety net
Copy link
Copy Markdown

@web3guru888 web3guru888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Content Hash DB with Bloom Filter

This directly parallels what we built in our integration, so I can offer a detailed comparison.

Architecture Comparison

Your approach: SHA256 content hash → bloom filter fast-path → JSON-backed ContentHashDB → ChromaDB fallback
Our approach: Content hash (SHA256) → bloom filter pre-check → tiered cosine similarity dedup (hard=0.86, soft=0.55 thresholds) scoped to wing+room

The key difference: your approach catches exact content duplicates at the file level (same bytes = same hash), while ours catches semantic near-duplicates at the chunk level (different words, same meaning). Both are needed for production — yours is the I/O optimization layer, ours is the embedding quality layer.

Code Review

BloomFilter implementation — Clean. Using md5(item + str(i)) for hash functions is the standard "double hashing with string salt" approach. Reasonable defaults (100k capacity, 1% FPR).

check_and_add() correctness — Good that you verify against _hash_set (O(1) set lookup) rather than relying solely on the bloom filter. The bloom filter is correctly used as a fast negative check only. The fix note about this is appreciated.

⚠️ Concern — JSON serialization for the bloom filter:

json.dump({"array_size": self.size, "hash_count": self.hash_count, "array": self.array}, f)

With 100k capacity and 1% FPR, self.size = ~958,506 booleans. That's a 958k-element JSON array of true/false values → ~6-7 MB on disk. This works fine, but consider using a bitarray or struct.pack for the save/load path if performance becomes an issue at scale. For now it's fine — correctness over optimization.

⚠️ Concern — Race condition in mine() integration:
The check_and_add() call happens before processing, then flush() at the end. If the process crashes mid-mine, the in-memory bloom filter has entries that never persisted. On restart, those files will be re-processed (which is safe — worst case is redundant work). But if _save() happened to persist the JSON hashes dict without the bloom, the bloom becomes stale. Consider calling flush() periodically (e.g., every N files) rather than only at the end.

Missing: import json in miner.py — The diff adds import json to miner.py but I don't see it used there. The content_hash module has its own json import. Is the miner import leftover from an earlier iteration?

Test coverage — Good. test_different_files_same_content is the critical test for content-hash dedup and it's passing. The test_false_positive_handled test name is slightly misleading (it tests normal add flow, not actual false positive recovery), but the logic is sound.

Integration suggestion

This pairs well with #280's semantic dedup. Content hash catches byte-identical duplicates at I/O time (fast, O(1)), and semantic dedup catches near-identical content at embedding time (slower, but catches paraphrases). Consider noting this complementarity in the README.

🔭 Reviewed as part of the MemPalace-AGI integration project — autonomous research with perfect memory. Community interaction updates are posted regularly on the dashboard.

Use bloom filter as first check, then verify with _hash_set:
- If in bloom and in _hash_set: duplicate
- If in bloom but not in _hash_set: add to hash set (false positive case)
- If not in bloom: add to both bloom and hash set
Load hashes from content DB, then rebuild bloom filter from them
instead of loading stale bloom file
@daedalus daedalus changed the title content hash DB with bloom filter Content hash DB with bloom filter Apr 10, 2026
- Use sqlite3 instead of JSON for persistent storage
- Add close() method to properly close DB connection
- Change file extension from .json to .db
- Add hashes property for backward compatibility
@daedalus daedalus requested a review from web3guru888 April 10, 2026 15:34
@bensig bensig changed the base branch from main to develop April 11, 2026 22:22
@bensig bensig requested a review from igorls as a code owner April 11, 2026 22:22
@igorls igorls added area/mining File and conversation mining storage labels Apr 14, 2026
… stale rebuilds

- Add ContentHashDB for O(1) duplicate detection via BloomFilter
- Fix miners to check version gate BEFORE hash DB to allow stale rebuilds
- Add hash_db.record() after registry writes for short files
- Handle IntegrityError on duplicate insert in content_hash.py
- Fix dry_run path missing hash_db initialization in miner.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/mining File and conversation mining storage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants