Skip to content

feat: auto-populate knowledge graph from palace drawers#434

Open
Nitrogonza9 wants to merge 3 commits intoMemPalace:developfrom
Nitrogonza9:feat/kg-auto-extract
Open

feat: auto-populate knowledge graph from palace drawers#434
Nitrogonza9 wants to merge 3 commits intoMemPalace:developfrom
Nitrogonza9:feat/kg-auto-extract

Conversation

@Nitrogonza9
Copy link
Copy Markdown

Summary

The Knowledge Graph is powerful but empty for most users — it requires manual kg_add calls for every fact. This PR bridges that gap by automatically extracting relationships from existing palace drawers into the KG.

New module: mempalace/kg_extractor.py

Reads verbatim drawer content and extracts 8 relationship types using rule-based pattern matching:

Pattern Example KG Triple
Employment "Alice works at Acme Corp" Alice → works_at → Acme Corp
Role "Bob is the lead engineer" Bob → has_role → lead engineer
Family "Alice's daughter Riley" Alice → parent_of → Riley
Marriage "Dan is Carol's husband" Dan → married_to → Carol
Tool usage "We use PostgreSQL" team → uses → PostgreSQL
Decisions "We switched to GraphQL" team → uses → GraphQL
Creation "Alice created the auth module" Alice → created → auth module
Interests "Max loves chess" Max → loves → chess

Integration points

  • CLI: mempalace extract-kg [--wing X] [--room Y] [--dry-run]
  • MCP tool: mempalace_kg_extract — AI agents can trigger extraction
  • Idempotent: existing triples are detected and skipped (safe to re-run)
  • Source provenance: each triple records which drawer file it came from
  • Batched reads: handles large palaces without OOM (500 drawers/batch)

Why this matters

This connects the two storage systems: drawers (verbatim text in ChromaDB) and the knowledge graph (structured facts in SQLite). Before this PR, users had thousands of memories but an empty KG. After: one command populates it.

mempalace mine ~/projects/myapp        # stores verbatim memories
mempalace extract-kg                    # auto-populates the KG
mempalace extract-kg --dry-run          # preview what would be extracted

Test plan

  • pytest tests/test_kg_extractor.py -v — 29 tests pass
  • pytest tests/ -v — full suite 563 passed, 0 failed
  • ruff check — no lint errors
  • No new dependencies
  • No API keys or network access needed
  • Dry-run mode tested (no KG writes)
  • Idempotency tested (running twice doesn't duplicate)
  • Wing/room filtering tested

🤖 Generated with Claude Code

New kg_extractor.py module that reads verbatim drawer content and
automatically extracts entity relationships into the knowledge graph.
Supports 8 relationship types: employment, roles, family, marriage,
tool usage, tech decisions, creation/authorship, and interests.

Pipeline: read drawers in batches → regex pattern matching → deduplicate
→ write to KG with source provenance. Fully idempotent — existing
triples are detected and skipped.

New CLI command: mempalace extract-kg [--wing X] [--room Y] [--dry-run]
New MCP tool: mempalace_kg_extract for AI agents to trigger extraction.

This bridges the gap between stored memories (drawers) and structured
knowledge (KG) — users no longer need to manually call kg_add for
every fact. Zero API calls, zero new dependencies.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@Nitrogonza9
Copy link
Copy Markdown
Author

Why I contributed this

I'm an intensive AI user running 7 large apps daily. The biggest friction I experience is that AI memory disappears between sessions — every conversation starts from zero. MemPalace is the best open-source answer to that problem I've found.

But the Knowledge Graph was empty for everyone. You had thousands of memories in drawers but zero structured facts. That's what this PR fixes — and it's part of a larger batch of contributions I submitted today:

6 PRs, ~3,000 lines, 101 new tests, zero new dependencies.

I believe AI memory should be free, local, and belong to everyone — not locked behind subscriptions or cloud APIs. That's why MemPalace matters, and that's why I'm contributing. Happy to iterate on any of these based on your feedback.

— Gonzalo

Copy link
Copy Markdown

@web3guru888 web3guru888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Auto-populate KG from palace drawers

This is exactly the bridge that makes the KG useful out of the box. We built something equivalent in our integration (knowledge_graph_bridge.py) that populates 710 entities and 1,014 triples from 208 discovery records across five domains, so I can share what we learned at that scale.

What works well:

  • The 8 pattern types cover the right surface area for personal/team knowledge. Employment, roles, family, tools, creation — these are what real palaces contain.
  • Idempotent extraction with dedup is crucial. We re-run extraction after every discovery cycle and early on we had triple explosion before adding dedup. Your _dedupe_triples() on (subject.lower(), predicate, object.lower()) handles this cleanly.
  • Batched reads at 500/batch — sensible for large palaces. We hit OOM at ~2,000 drawers without batching.
  • Source provenance (source_file) is excellent. When a contradiction is flagged, you need to trace which drawer made the claim.
  • The --dry-run flag is a great UX choice.

Observations from running at scale:

  1. Dedup threshold: Your dedup is exact-match after lowercasing. At 710 entities, we found we also needed fuzzy dedup — "PostgreSQL", "Postgres", "postgres db" all refer to the same entity. We use a tiered approach: hard dedup at cosine 0.86 (definitely same entity), soft dedup at 0.55 (flag for review). Something to consider as palaces grow, though exact-match is a reasonable v1.

  2. kg.stats() per-triple for dedup detection: In the write loop, you call kg.stats() before and after each add_triple() to detect whether a triple was actually new. With hundreds of triples, that's 2N SQLite queries just for dedup bookkeeping. Consider checking existence before writing instead — e.g., a kg.triple_exists(s, p, o) method would be much cheaper. Or if add_triple could return a boolean indicating whether it was new, that would eliminate the stats calls entirely.

  3. Entity type inference: Your add_triple writes triples but I don't see where extracted entities get typed (person vs company vs tool). The KG's add_entity(name, entity_type=...) supports types. You could infer: employment pattern → subject=person, object=company; tool pattern → object=tool. This makes KG queries by entity type much more useful downstream.

  4. Cross-drawer entity resolution: "Alice" in drawer A and "Alice Chen" in drawer B — are they the same entity? At 710 entities we hit this regularly. Not a blocker for v1, but worth a TODO comment.

Minor:

  • The _CREATION_PATTERNS regex uses a greedy .+? for the object, which is good, but terminated by [.,;]|$. If the sentence ends without punctuation (common in drawer notes), the match extends to end-of-string and _clean_object truncates at 60 chars. Works, but worth a test case for long unpunctuated lines.

29 tests, clean lint, no new deps — well structured. This and #433 compose naturally: extract KG with this PR, then fact-check new claims against it.

@Nitrogonza9
Copy link
Copy Markdown
Author

Great review @web3guru888 — real production experience shows. Let me address each point:

kg.stats() per-triple — fixing now. You're right, 2N SQLite queries for dedup bookkeeping is wasteful. I'll switch to checking triple count before/after the full batch rather than per-triple, or better yet, compare the triple_id against a pre-fetched set. Pushing the fix shortly.

Entity type inference — implementing. Good catch. The patterns already know the type: employment → (person, company), tool → (team, tool), family → (person, person), role → (person, role). I'll pass entity_type to add_entity() during extraction. Low effort, high value.

Fuzzy entity dedup ("PostgreSQL" vs "Postgres" vs "postgres db") — agreed this is needed at scale. For v1 I'll add a TODO comment with your tiered threshold approach (0.86 hard / 0.55 soft) as the future direction. Implementing it properly needs embedding similarity which touches the "no new deps" constraint, so it's a separate PR.

Cross-drawer entity resolution ("Alice" vs "Alice Chen") — same story, noted as TODO. The entity_registry.py already has some disambiguation infrastructure that could be wired in.

Long unpunctuated lines — adding a test case for this edge.

Re: composition with #433 — exactly the intent. Extract with this PR, fact-check with #433, the KG becomes both self-populating and self-correcting.

Pushing fixes now.

— Gonzalo

Address review feedback from @web3guru888:

- Replace per-triple kg.stats() calls with single before/after count
  (eliminates 2N SQLite queries during extraction)
- Add entity type inference: employment → person/company, tool → tool,
  role → person/role, creation → person, interest → person
- Add TODO for fuzzy entity dedup with tiered thresholds
- Add test for long unpunctuated lines (creation pattern truncation)
- Add test for entity type inference

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@web3guru888
Copy link
Copy Markdown

The before/after triple count check is a clean approach — avoids the per-triple query overhead while still giving you dedup signal. One small thing: make sure the count check is inside a transaction with the batch add, otherwise a concurrent write between count() and the upsert can make the comparison unreliable.

The entity type inference from pattern context is a solid v1. And the tiered threshold TODO is exactly the right call — embedding similarity is a natural v2 once the pattern-based extraction is stable.

The #433 composition framing (extract → fact-check → self-correcting KG) is compelling. If both PRs land, auto_kg generating the entities and check_facts validating them creates a quality loop without any new infrastructure. Worth calling that out in the PR description for reviewers to understand the intent.

Per @web3guru888 review on PR MemPalace#434: replace before/after stats() count
with a pre-fetched set of existing triple keys. The previous approach
could be unreliable under concurrent writes between count() and add.

Pre-fetching the keys once at the start of the extraction batch creates
a consistent snapshot. We update the in-memory set as new triples are
added, so duplicate detection within the batch also works correctly.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@bensig bensig changed the base branch from main to develop April 11, 2026 22:22
@igorls igorls added area/cli CLI commands area/kg Knowledge graph area/mcp MCP server and tools enhancement New feature or request labels Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cli CLI commands area/kg Knowledge graph area/mcp MCP server and tools enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants