feat: auto-populate knowledge graph from palace drawers#434
feat: auto-populate knowledge graph from palace drawers#434Nitrogonza9 wants to merge 3 commits intoMemPalace:developfrom
Conversation
New kg_extractor.py module that reads verbatim drawer content and automatically extracts entity relationships into the knowledge graph. Supports 8 relationship types: employment, roles, family, marriage, tool usage, tech decisions, creation/authorship, and interests. Pipeline: read drawers in batches → regex pattern matching → deduplicate → write to KG with source provenance. Fully idempotent — existing triples are detected and skipped. New CLI command: mempalace extract-kg [--wing X] [--room Y] [--dry-run] New MCP tool: mempalace_kg_extract for AI agents to trigger extraction. This bridges the gap between stored memories (drawers) and structured knowledge (KG) — users no longer need to manually call kg_add for every fact. Zero API calls, zero new dependencies. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Why I contributed thisI'm an intensive AI user running 7 large apps daily. The biggest friction I experience is that AI memory disappears between sessions — every conversation starts from zero. MemPalace is the best open-source answer to that problem I've found. But the Knowledge Graph was empty for everyone. You had thousands of memories in drawers but zero structured facts. That's what this PR fixes — and it's part of a larger batch of contributions I submitted today:
6 PRs, ~3,000 lines, 101 new tests, zero new dependencies. I believe AI memory should be free, local, and belong to everyone — not locked behind subscriptions or cloud APIs. That's why MemPalace matters, and that's why I'm contributing. Happy to iterate on any of these based on your feedback. — Gonzalo |
web3guru888
left a comment
There was a problem hiding this comment.
Review: Auto-populate KG from palace drawers
This is exactly the bridge that makes the KG useful out of the box. We built something equivalent in our integration (knowledge_graph_bridge.py) that populates 710 entities and 1,014 triples from 208 discovery records across five domains, so I can share what we learned at that scale.
What works well:
- The 8 pattern types cover the right surface area for personal/team knowledge. Employment, roles, family, tools, creation — these are what real palaces contain.
- Idempotent extraction with dedup is crucial. We re-run extraction after every discovery cycle and early on we had triple explosion before adding dedup. Your
_dedupe_triples()on(subject.lower(), predicate, object.lower())handles this cleanly. - Batched reads at 500/batch — sensible for large palaces. We hit OOM at ~2,000 drawers without batching.
- Source provenance (
source_file) is excellent. When a contradiction is flagged, you need to trace which drawer made the claim. - The
--dry-runflag is a great UX choice.
Observations from running at scale:
-
Dedup threshold: Your dedup is exact-match after lowercasing. At 710 entities, we found we also needed fuzzy dedup — "PostgreSQL", "Postgres", "postgres db" all refer to the same entity. We use a tiered approach: hard dedup at cosine 0.86 (definitely same entity), soft dedup at 0.55 (flag for review). Something to consider as palaces grow, though exact-match is a reasonable v1.
-
kg.stats()per-triple for dedup detection: In the write loop, you callkg.stats()before and after eachadd_triple()to detect whether a triple was actually new. With hundreds of triples, that's 2N SQLite queries just for dedup bookkeeping. Consider checking existence before writing instead — e.g., akg.triple_exists(s, p, o)method would be much cheaper. Or ifadd_triplecould return a boolean indicating whether it was new, that would eliminate the stats calls entirely. -
Entity type inference: Your
add_triplewrites triples but I don't see where extracted entities get typed (person vs company vs tool). The KG'sadd_entity(name, entity_type=...)supports types. You could infer: employment pattern → subject=person, object=company; tool pattern → object=tool. This makes KG queries by entity type much more useful downstream. -
Cross-drawer entity resolution: "Alice" in drawer A and "Alice Chen" in drawer B — are they the same entity? At 710 entities we hit this regularly. Not a blocker for v1, but worth a TODO comment.
Minor:
- The
_CREATION_PATTERNSregex uses a greedy.+?for the object, which is good, but terminated by[.,;]|$. If the sentence ends without punctuation (common in drawer notes), the match extends to end-of-string and_clean_objecttruncates at 60 chars. Works, but worth a test case for long unpunctuated lines.
29 tests, clean lint, no new deps — well structured. This and #433 compose naturally: extract KG with this PR, then fact-check new claims against it.
|
Great review @web3guru888 — real production experience shows. Let me address each point:
Entity type inference — implementing. Good catch. The patterns already know the type: employment → (person, company), tool → (team, tool), family → (person, person), role → (person, role). I'll pass Fuzzy entity dedup ("PostgreSQL" vs "Postgres" vs "postgres db") — agreed this is needed at scale. For v1 I'll add a TODO comment with your tiered threshold approach (0.86 hard / 0.55 soft) as the future direction. Implementing it properly needs embedding similarity which touches the "no new deps" constraint, so it's a separate PR. Cross-drawer entity resolution ("Alice" vs "Alice Chen") — same story, noted as TODO. The entity_registry.py already has some disambiguation infrastructure that could be wired in. Long unpunctuated lines — adding a test case for this edge. Re: composition with #433 — exactly the intent. Extract with this PR, fact-check with #433, the KG becomes both self-populating and self-correcting. Pushing fixes now. — Gonzalo |
Address review feedback from @web3guru888: - Replace per-triple kg.stats() calls with single before/after count (eliminates 2N SQLite queries during extraction) - Add entity type inference: employment → person/company, tool → tool, role → person/role, creation → person, interest → person - Add TODO for fuzzy entity dedup with tiered thresholds - Add test for long unpunctuated lines (creation pattern truncation) - Add test for entity type inference Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
The before/after triple count check is a clean approach — avoids the per-triple query overhead while still giving you dedup signal. One small thing: make sure the count check is inside a transaction with the batch add, otherwise a concurrent write between count() and the upsert can make the comparison unreliable. The entity type inference from pattern context is a solid v1. And the tiered threshold TODO is exactly the right call — embedding similarity is a natural v2 once the pattern-based extraction is stable. The #433 composition framing (extract → fact-check → self-correcting KG) is compelling. If both PRs land, |
Per @web3guru888 review on PR MemPalace#434: replace before/after stats() count with a pre-fetched set of existing triple keys. The previous approach could be unreliable under concurrent writes between count() and add. Pre-fetching the keys once at the start of the extraction batch creates a consistent snapshot. We update the in-memory set as new triples are added, so duplicate detection within the batch also works correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Summary
The Knowledge Graph is powerful but empty for most users — it requires manual
kg_addcalls for every fact. This PR bridges that gap by automatically extracting relationships from existing palace drawers into the KG.New module:
mempalace/kg_extractor.pyReads verbatim drawer content and extracts 8 relationship types using rule-based pattern matching:
Integration points
mempalace extract-kg [--wing X] [--room Y] [--dry-run]mempalace_kg_extract— AI agents can trigger extractionWhy this matters
This connects the two storage systems: drawers (verbatim text in ChromaDB) and the knowledge graph (structured facts in SQLite). Before this PR, users had thousands of memories but an empty KG. After: one command populates it.
Test plan
pytest tests/test_kg_extractor.py -v— 29 tests passpytest tests/ -v— full suite 563 passed, 0 failedruff check— no lint errors🤖 Generated with Claude Code