feat: add AAAK expand for pre-embedding semantic quality by Nitrogonza9 · Pull Request #432 · MemPalace/mempalace

Nitrogonza9 · 2026-04-09T20:33:20Z

Summary

New expand() method on Dialect class — converts AAAK-compressed text back into natural-language fragments suitable for vector embedding. Reverses entity codes to names, reconstructs topics, preserves key sentences, and maps emotion codes to readable words.
New looks_like_aaak() static heuristic — detects AAAK format by checking for pipe-separated fields with digit-colon prefixes.
Wire into mcp_server.py diary write — AAAK entries are now expanded before ChromaDB embedding. Original compressed form is preserved in aaak_compressed metadata field. Plain text entries pass through unchanged (backward compatible).
Addresses the TODO at mcp_server.py:511 — "Future versions should expand AAAK before embedding to improve semantic search quality"

This should improve AAAK mode search quality (currently 84.2% R@5 vs 96.6% raw on LongMemEval) by embedding semantically rich text rather than compressed symbols.

Test plan

pytest tests/test_dialect.py -v — 29 tests pass (17 existing + 12 new)
pytest tests/ -v — full suite 545 passed, 0 failed
ruff check — no lint errors
No new dependencies added

🤖 Generated with Claude Code

Add expand() method to Dialect that converts AAAK-compressed text back into natural-language fragments suitable for semantic embedding. Add looks_like_aaak() heuristic to detect AAAK format. Wire into mcp_server diary_write: when an entry is AAAK-compressed, expand it before passing to ChromaDB for embedding while preserving the original compressed form in aaak_compressed metadata field. Plain text entries pass through unchanged (backward compatible). This addresses the TODO at mcp_server.py:511 and should improve AAAK mode search quality (currently 84.2% vs 96.6% raw on LongMemEval). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

web3guru888

Review: AAAK Expand for Pre-Embedding Semantic Quality

Good to see this from you again, @Nitrogonza9 — we reviewed your #433 (contradiction detection) and #434 (auto-KG) previously. This one addresses a real measured problem (84.2% vs 96.6% R@5 for AAAK vs raw on LongMemEval).

`expand()` Method

The reconstruction logic is solid. Walking the decoded structure and reassembling human-readable fragments is the right approach for embedding quality. Specific notes:

Entity code reversal:

for name, code in self.entity_codes.items():
    if not name.islower() and code not in code_to_name:
        code_to_name[code] = name

The not name.islower() filter ensures you prefer proper names ("Alice") over lowercase aliases. Good heuristic. The code not in code_to_name gives first-registered-name priority, which is reasonable.

Emotion expansion — Using _REVERSE_EMOTIONS built at module level is efficient. The handling of combined emotions ("determ+hope") by splitting on + is correct.

⚠️ Edge case — field parsing ambiguity:
The expand() method iterates zettel fields and tries to classify each one (entity, quote, emotion, topic). But what if a topic contains +? The check:

elif "+" in field and all(f.strip() in _REVERSE_EMOTIONS or f.strip().isupper() for f in field.split("+"))

would match a topic like "TCP+UDP" (both are uppercase) and treat it as combined emotions, yielding nonsense expansions. Consider checking against _REVERSE_EMOTIONS first and only treating it as combined emotions if at least one part is actually an emotion code.

`looks_like_aaak()` Heuristic

Clean implementation. The "pipe-separated with digit-colon prefix on first field" heuristic should have very low false-positive rates on normal text. The early "|" not in text bailout is good.

MCP Integration

⚠️ Import inside function body:

def tool_diary_write(...):
    from mempalace.dialect import Dialect
    _dialect = Dialect()

This creates a new Dialect instance on every diary write. If diary writes are frequent (our agents write ~50 per cycle), this is wasteful. Consider moving the import and instantiation to module level, or at minimum cache the dialect instance.

Metadata preservation — Storing the original AAAK in aaak_compressed metadata while embedding the expanded text is the right design. It preserves lossless access to the compressed form while giving the embedding model better input.

Impact Assessment

If this improves R@5 from 84.2% closer to 96.6% for AAAK entries, it's a meaningful quality improvement for anyone using the compress workflow. The change is backward-compatible (plain text entries pass through unchanged), and the looks_like_aaak() gate prevents false expansions.

Tests are comprehensive — 12 new tests covering expand, roundtrip, and heuristic detection. The test_expand_roundtrip_from_compress test is particularly valuable.

Good contribution. The TCP+UDP edge case and the per-call Dialect instantiation are worth addressing.

🔭 Reviewed as part of the MemPalace-AGI integration project — autonomous research with perfect memory. Community interaction updates are posted regularly on the dashboard.

Nitrogonza9 · 2026-04-10T03:13:42Z

Thanks @web3guru888 — you've reviewed all 4 of my main PRs now, really appreciate the thoroughness.

TCP+UDP edge case — fixing now. You're right, the + split for combined emotions would misclassify uppercase topics. I'll add a check that requires at least one part to be a known emotion code before treating the field as combined emotions.

Dialect instantiation per-call — fixing. Moving to a module-level cached instance. No reason to create a new one on every diary write.

Both are clean, targeted fixes. Pushing shortly.

— Gonzalo

@web3guru888

Address review feedback from @web3guru888: - Fix combined emotion detection: require at least one known emotion code before treating a '+'-separated field as combined emotions. Prevents uppercase topics like "TCP+UDP" from being misclassified as emotions. - Only instantiate Dialect() when entry is actually AAAK (checked via static looks_like_aaak() first). Plain text entries skip instantiation. - Add test for uppercase topic edge case. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

web3guru888 · 2026-04-10T04:03:21Z

Both fixes are exactly right — the combined-emotion guard requiring at least one known emotion code before splitting is cleaner than my suggested approach anyway. Moving Dialect to module-level is a trivial win.

Once you push, I'd suggest a quick test with something like HTTPS+REST and TCP+UDP as wing names to confirm the guard catches both. The + in protocol names is common enough in tech docs that it's worth having an explicit regression case.

Looking forward to seeing the updates.

@web3guru888

Per @web3guru888's suggestion, add explicit regression test for common protocol names (HTTPS+REST, TCP+UDP, HTTP+JSON, GRPC+PROTOBUF) to confirm the combined-emotion guard correctly classifies them as topics rather than emotion codes. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Nitrogonza9 mentioned this pull request Apr 9, 2026

feat: auto-populate knowledge graph from palace drawers #434

Open

8 tasks

web3guru888 reviewed Apr 10, 2026

View reviewed changes

web3guru888 mentioned this pull request Apr 10, 2026

BUG: Entity detector flags common code terms as projects (Handler, Node, One, Service) #476

Open

bensig changed the base branch from main to develop April 11, 2026 22:22

bensig requested review from bensig, igorls and milla-jovovich as code owners April 11, 2026 22:22

igorls added area/mcp MCP server and tools enhancement New feature or request labels Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add AAAK expand for pre-embedding semantic quality#432

feat: add AAAK expand for pre-embedding semantic quality#432
Nitrogonza9 wants to merge 3 commits intoMemPalace:developfrom
Nitrogonza9:feat/aaak-expand

Nitrogonza9 commented Apr 9, 2026

Uh oh!

web3guru888 left a comment

Uh oh!

Nitrogonza9 commented Apr 10, 2026

Uh oh!

web3guru888 commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Nitrogonza9 commented Apr 9, 2026

Summary

Test plan

Uh oh!

web3guru888 left a comment

Choose a reason for hiding this comment

Review: AAAK Expand for Pre-Embedding Semantic Quality

expand() Method

looks_like_aaak() Heuristic

MCP Integration

Impact Assessment

Uh oh!

Nitrogonza9 commented Apr 10, 2026

Uh oh!

web3guru888 commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

`expand()` Method

`looks_like_aaak()` Heuristic