Content hash DB with bloom filter by daedalus · Pull Request #424 · MemPalace/mempalace

daedalus · 2026-04-09T19:45:12Z

Summary

Add a persistent content hash database that tracks file content via SHA256 hashes
Implement a bloom filter for fast O(1) duplicate checking before reading files from disk
Files with identical content (even in different locations) are now skipped automatically

Changes

mempalace/miner.py: Added BloomFilter and ContentHashDB classes. Updated mine() to use content hash checking instead of just path-based checking.
mempalace/convo_miner.py: Updated to use the same hash database for conversation file mining.
tests/test_hashdb.py: Comprehensive tests for bloom filter and content hash DB functionality.

Why this matters

Previously, mempalace mine would re-read files even if they had not changed. Now it:

Computes SHA256 hash of file content
Uses bloom filter for instant "definitely new" / "might be duplicate" checks
Only re-reads files when content hash is genuinely new

This makes re-mining projects much faster and avoids redundant I/O.

Fixes the content hash database with bloom filter feature from PR #282 with the following improvements:

Critical bug fix: check_and_add() now correctly verifies content hashes using _hash_set for O(1) lookup before relying on bloom filter (prevents false positive duplicates)
Performance: Add flush() method to batch save bloom filter to disk only after batch operations
Safety net: Keep file_already_mined() fallback to ChromaDB when JSON is deleted/corrupted
Decoupling: Move BloomFilter and ContentHashDB classes to dedicated content_hash.py module
Tests: Update to use pytest tmp_path fixture
Error handling: Change bare except Exception to (OSError, IOError)

…ection

- Fix check_and_add() bug: verify with _hash_set for O(1) lookup before bloom - Add flush() method for batch saving (saves to disk only when needed) - Decouple: move BloomFilter and ContentHashDB to dedicated content_hash.py - Update tests to use pytest tmp_path fixture - Fix exception handling to (OSError, IOError) - Add file_already_mined() fallback to ChromaDB as safety net

web3guru888

Review: Content Hash DB with Bloom Filter

This directly parallels what we built in our integration, so I can offer a detailed comparison.

Architecture Comparison

Your approach: SHA256 content hash → bloom filter fast-path → JSON-backed ContentHashDB → ChromaDB fallback
Our approach: Content hash (SHA256) → bloom filter pre-check → tiered cosine similarity dedup (hard=0.86, soft=0.55 thresholds) scoped to wing+room

The key difference: your approach catches exact content duplicates at the file level (same bytes = same hash), while ours catches semantic near-duplicates at the chunk level (different words, same meaning). Both are needed for production — yours is the I/O optimization layer, ours is the embedding quality layer.

Code Review

BloomFilter implementation — Clean. Using md5(item + str(i)) for hash functions is the standard "double hashing with string salt" approach. Reasonable defaults (100k capacity, 1% FPR).

check_and_add() correctness — Good that you verify against _hash_set (O(1) set lookup) rather than relying solely on the bloom filter. The bloom filter is correctly used as a fast negative check only. The fix note about this is appreciated.

⚠️ Concern — JSON serialization for the bloom filter:

json.dump({"array_size": self.size, "hash_count": self.hash_count, "array": self.array}, f)

With 100k capacity and 1% FPR, self.size = ~958,506 booleans. That's a 958k-element JSON array of true/false values → ~6-7 MB on disk. This works fine, but consider using a bitarray or struct.pack for the save/load path if performance becomes an issue at scale. For now it's fine — correctness over optimization.

⚠️ Concern — Race condition in mine() integration:
The check_and_add() call happens before processing, then flush() at the end. If the process crashes mid-mine, the in-memory bloom filter has entries that never persisted. On restart, those files will be re-processed (which is safe — worst case is redundant work). But if _save() happened to persist the JSON hashes dict without the bloom, the bloom becomes stale. Consider calling flush() periodically (e.g., every N files) rather than only at the end.

Missing: import json in miner.py — The diff adds import json to miner.py but I don't see it used there. The content_hash module has its own json import. Is the miner import leftover from an earlier iteration?

Test coverage — Good. test_different_files_same_content is the critical test for content-hash dedup and it's passing. The test_false_positive_handled test name is slightly misleading (it tests normal add flow, not actual false positive recovery), but the logic is sound.

Integration suggestion

This pairs well with #280's semantic dedup. Content hash catches byte-identical duplicates at I/O time (fast, O(1)), and semantic dedup catches near-identical content at embedding time (slower, but catches paraphrases). Consider noting this complementarity in the README.

🔭 Reviewed as part of the MemPalace-AGI integration project — autonomous research with perfect memory. Community interaction updates are posted regularly on the dashboard.

Use bloom filter as first check, then verify with _hash_set: - If in bloom and in _hash_set: duplicate - If in bloom but not in _hash_set: add to hash set (false positive case) - If not in bloom: add to both bloom and hash set

Load hashes from content DB, then rebuild bloom filter from them instead of loading stale bloom file

- Use sqlite3 instead of JSON for persistent storage - Add close() method to properly close DB connection - Change file extension from .json to .db - Add hashes property for backward compatibility

… stale rebuilds - Add ContentHashDB for O(1) duplicate detection via BloomFilter - Fix miners to check version gate BEFORE hash DB to allow stale rebuilds - Add hash_db.record() after registry writes for short files - Handle IntegrityError on duplicate insert in content_hash.py - Fix dry_run path missing hash_db initialization in miner.py

daedalus added 3 commits April 8, 2026 16:12

Add content hash database with bloom filter for smarter duplicate det…

739529e

…ection

Merge bloom-filter-v2 into bloom-filter-v3

64608ac

web3guru888 reviewed Apr 10, 2026

View reviewed changes

This was referenced Apr 10, 2026

feat(dedup): add diary dedup guard and palace-wide duplicate report tool #280

Open

feat: parallel file processing and batched upserts in mine() #416

Open

daedalus added 2 commits April 10, 2026 12:13

Refactor check_and_add: bloom-first logic

a24f2a8

Use bloom filter as first check, then verify with _hash_set: - If in bloom and in _hash_set: duplicate - If in bloom but not in _hash_set: add to hash set (false positive case) - If not in bloom: add to both bloom and hash set

Prepopulate bloom filter from hash DB on startup

dd49f16

Load hashes from content DB, then rebuild bloom filter from them instead of loading stale bloom file

daedalus changed the title ~~content hash DB with bloom filter~~ Content hash DB with bloom filter Apr 10, 2026

Switch content hash DB to SQLite

9fda865

- Use sqlite3 instead of JSON for persistent storage - Add close() method to properly close DB connection - Change file extension from .json to .db - Add hashes property for backward compatibility

daedalus requested a review from web3guru888 April 10, 2026 15:34

Merge branch 'main' into bloom-filter-v3

5383c3b

daedalus requested review from bensig and milla-jovovich as code owners April 11, 2026 05:26

bensig changed the base branch from main to develop April 11, 2026 22:22

bensig requested a review from igorls as a code owner April 11, 2026 22:22

igorls added area/mining File and conversation mining storage labels Apr 14, 2026

daedalus added 2 commits April 15, 2026 14:17

Merge branch 'develop' into bloom-filter-v3

287cd99

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Content hash DB with bloom filter#424

Content hash DB with bloom filter#424
daedalus wants to merge 9 commits intoMemPalace:developfrom
daedalus:bloom-filter-v3

daedalus commented Apr 9, 2026 •

edited

Loading

Uh oh!

web3guru888 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

daedalus commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Why this matters

Uh oh!

web3guru888 left a comment

Choose a reason for hiding this comment

Review: Content Hash DB with Bloom Filter

Architecture Comparison

Code Review

Integration suggestion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

daedalus commented Apr 9, 2026 •

edited

Loading