Skip to content

feat: add AAAK expand for pre-embedding semantic quality#432

Open
Nitrogonza9 wants to merge 3 commits intoMemPalace:developfrom
Nitrogonza9:feat/aaak-expand
Open

feat: add AAAK expand for pre-embedding semantic quality#432
Nitrogonza9 wants to merge 3 commits intoMemPalace:developfrom
Nitrogonza9:feat/aaak-expand

Conversation

@Nitrogonza9
Copy link
Copy Markdown

Summary

  • New expand() method on Dialect class — converts AAAK-compressed text back into natural-language fragments suitable for vector embedding. Reverses entity codes to names, reconstructs topics, preserves key sentences, and maps emotion codes to readable words.
  • New looks_like_aaak() static heuristic — detects AAAK format by checking for pipe-separated fields with digit-colon prefixes.
  • Wire into mcp_server.py diary write — AAAK entries are now expanded before ChromaDB embedding. Original compressed form is preserved in aaak_compressed metadata field. Plain text entries pass through unchanged (backward compatible).
  • Addresses the TODO at mcp_server.py:511 — "Future versions should expand AAAK before embedding to improve semantic search quality"

This should improve AAAK mode search quality (currently 84.2% R@5 vs 96.6% raw on LongMemEval) by embedding semantically rich text rather than compressed symbols.

Test plan

  • pytest tests/test_dialect.py -v — 29 tests pass (17 existing + 12 new)
  • pytest tests/ -v — full suite 545 passed, 0 failed
  • ruff check — no lint errors
  • No new dependencies added

🤖 Generated with Claude Code

Add expand() method to Dialect that converts AAAK-compressed text back
into natural-language fragments suitable for semantic embedding. Add
looks_like_aaak() heuristic to detect AAAK format.

Wire into mcp_server diary_write: when an entry is AAAK-compressed,
expand it before passing to ChromaDB for embedding while preserving
the original compressed form in aaak_compressed metadata field. Plain
text entries pass through unchanged (backward compatible).

This addresses the TODO at mcp_server.py:511 and should improve AAAK
mode search quality (currently 84.2% vs 96.6% raw on LongMemEval).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Copy link
Copy Markdown

@web3guru888 web3guru888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: AAAK Expand for Pre-Embedding Semantic Quality

Good to see this from you again, @Nitrogonza9 — we reviewed your #433 (contradiction detection) and #434 (auto-KG) previously. This one addresses a real measured problem (84.2% vs 96.6% R@5 for AAAK vs raw on LongMemEval).

expand() Method

The reconstruction logic is solid. Walking the decoded structure and reassembling human-readable fragments is the right approach for embedding quality. Specific notes:

Entity code reversal:

for name, code in self.entity_codes.items():
    if not name.islower() and code not in code_to_name:
        code_to_name[code] = name

The not name.islower() filter ensures you prefer proper names ("Alice") over lowercase aliases. Good heuristic. The code not in code_to_name gives first-registered-name priority, which is reasonable.

Emotion expansion — Using _REVERSE_EMOTIONS built at module level is efficient. The handling of combined emotions ("determ+hope") by splitting on + is correct.

⚠️ Edge case — field parsing ambiguity:
The expand() method iterates zettel fields and tries to classify each one (entity, quote, emotion, topic). But what if a topic contains +? The check:

elif "+" in field and all(f.strip() in _REVERSE_EMOTIONS or f.strip().isupper() for f in field.split("+"))

would match a topic like "TCP+UDP" (both are uppercase) and treat it as combined emotions, yielding nonsense expansions. Consider checking against _REVERSE_EMOTIONS first and only treating it as combined emotions if at least one part is actually an emotion code.

looks_like_aaak() Heuristic

Clean implementation. The "pipe-separated with digit-colon prefix on first field" heuristic should have very low false-positive rates on normal text. The early "|" not in text bailout is good.

MCP Integration

⚠️ Import inside function body:

def tool_diary_write(...):
    from mempalace.dialect import Dialect
    _dialect = Dialect()

This creates a new Dialect instance on every diary write. If diary writes are frequent (our agents write ~50 per cycle), this is wasteful. Consider moving the import and instantiation to module level, or at minimum cache the dialect instance.

Metadata preservation — Storing the original AAAK in aaak_compressed metadata while embedding the expanded text is the right design. It preserves lossless access to the compressed form while giving the embedding model better input.

Impact Assessment

If this improves R@5 from 84.2% closer to 96.6% for AAAK entries, it's a meaningful quality improvement for anyone using the compress workflow. The change is backward-compatible (plain text entries pass through unchanged), and the looks_like_aaak() gate prevents false expansions.

Tests are comprehensive — 12 new tests covering expand, roundtrip, and heuristic detection. The test_expand_roundtrip_from_compress test is particularly valuable.

Good contribution. The TCP+UDP edge case and the per-call Dialect instantiation are worth addressing.

🔭 Reviewed as part of the MemPalace-AGI integration project — autonomous research with perfect memory. Community interaction updates are posted regularly on the dashboard.

@Nitrogonza9
Copy link
Copy Markdown
Author

Thanks @web3guru888 — you've reviewed all 4 of my main PRs now, really appreciate the thoroughness.

TCP+UDP edge case — fixing now. You're right, the + split for combined emotions would misclassify uppercase topics. I'll add a check that requires at least one part to be a known emotion code before treating the field as combined emotions.

Dialect instantiation per-call — fixing. Moving to a module-level cached instance. No reason to create a new one on every diary write.

Both are clean, targeted fixes. Pushing shortly.

— Gonzalo

Address review feedback from @web3guru888:

- Fix combined emotion detection: require at least one known emotion code
  before treating a '+'-separated field as combined emotions. Prevents
  uppercase topics like "TCP+UDP" from being misclassified as emotions.
- Only instantiate Dialect() when entry is actually AAAK (checked via
  static looks_like_aaak() first). Plain text entries skip instantiation.
- Add test for uppercase topic edge case.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@web3guru888
Copy link
Copy Markdown

Both fixes are exactly right — the combined-emotion guard requiring at least one known emotion code before splitting is cleaner than my suggested approach anyway. Moving Dialect to module-level is a trivial win.

Once you push, I'd suggest a quick test with something like HTTPS+REST and TCP+UDP as wing names to confirm the guard catches both. The + in protocol names is common enough in tech docs that it's worth having an explicit regression case.

Looking forward to seeing the updates.

Per @web3guru888's suggestion, add explicit regression test for
common protocol names (HTTPS+REST, TCP+UDP, HTTP+JSON, GRPC+PROTOBUF)
to confirm the combined-emotion guard correctly classifies them as
topics rather than emotion codes.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@bensig bensig changed the base branch from main to develop April 11, 2026 22:22
@igorls igorls added area/mcp MCP server and tools enhancement New feature or request labels Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/mcp MCP server and tools enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants