feat: add AAAK expand for pre-embedding semantic quality#432
feat: add AAAK expand for pre-embedding semantic quality#432Nitrogonza9 wants to merge 3 commits intoMemPalace:developfrom
Conversation
Add expand() method to Dialect that converts AAAK-compressed text back into natural-language fragments suitable for semantic embedding. Add looks_like_aaak() heuristic to detect AAAK format. Wire into mcp_server diary_write: when an entry is AAAK-compressed, expand it before passing to ChromaDB for embedding while preserving the original compressed form in aaak_compressed metadata field. Plain text entries pass through unchanged (backward compatible). This addresses the TODO at mcp_server.py:511 and should improve AAAK mode search quality (currently 84.2% vs 96.6% raw on LongMemEval). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
web3guru888
left a comment
There was a problem hiding this comment.
Review: AAAK Expand for Pre-Embedding Semantic Quality
Good to see this from you again, @Nitrogonza9 — we reviewed your #433 (contradiction detection) and #434 (auto-KG) previously. This one addresses a real measured problem (84.2% vs 96.6% R@5 for AAAK vs raw on LongMemEval).
expand() Method
The reconstruction logic is solid. Walking the decoded structure and reassembling human-readable fragments is the right approach for embedding quality. Specific notes:
Entity code reversal:
for name, code in self.entity_codes.items():
if not name.islower() and code not in code_to_name:
code_to_name[code] = nameThe not name.islower() filter ensures you prefer proper names ("Alice") over lowercase aliases. Good heuristic. The code not in code_to_name gives first-registered-name priority, which is reasonable.
Emotion expansion — Using _REVERSE_EMOTIONS built at module level is efficient. The handling of combined emotions ("determ+hope") by splitting on + is correct.
The expand() method iterates zettel fields and tries to classify each one (entity, quote, emotion, topic). But what if a topic contains +? The check:
elif "+" in field and all(f.strip() in _REVERSE_EMOTIONS or f.strip().isupper() for f in field.split("+"))would match a topic like "TCP+UDP" (both are uppercase) and treat it as combined emotions, yielding nonsense expansions. Consider checking against _REVERSE_EMOTIONS first and only treating it as combined emotions if at least one part is actually an emotion code.
looks_like_aaak() Heuristic
Clean implementation. The "pipe-separated with digit-colon prefix on first field" heuristic should have very low false-positive rates on normal text. The early "|" not in text bailout is good.
MCP Integration
def tool_diary_write(...):
from mempalace.dialect import Dialect
_dialect = Dialect()This creates a new Dialect instance on every diary write. If diary writes are frequent (our agents write ~50 per cycle), this is wasteful. Consider moving the import and instantiation to module level, or at minimum cache the dialect instance.
Metadata preservation — Storing the original AAAK in aaak_compressed metadata while embedding the expanded text is the right design. It preserves lossless access to the compressed form while giving the embedding model better input.
Impact Assessment
If this improves R@5 from 84.2% closer to 96.6% for AAAK entries, it's a meaningful quality improvement for anyone using the compress workflow. The change is backward-compatible (plain text entries pass through unchanged), and the looks_like_aaak() gate prevents false expansions.
Tests are comprehensive — 12 new tests covering expand, roundtrip, and heuristic detection. The test_expand_roundtrip_from_compress test is particularly valuable.
Good contribution. The TCP+UDP edge case and the per-call Dialect instantiation are worth addressing.
🔭 Reviewed as part of the MemPalace-AGI integration project — autonomous research with perfect memory. Community interaction updates are posted regularly on the dashboard.
|
Thanks @web3guru888 — you've reviewed all 4 of my main PRs now, really appreciate the thoroughness. TCP+UDP edge case — fixing now. You're right, the Dialect instantiation per-call — fixing. Moving to a module-level cached instance. No reason to create a new one on every diary write. Both are clean, targeted fixes. Pushing shortly. — Gonzalo |
Address review feedback from @web3guru888: - Fix combined emotion detection: require at least one known emotion code before treating a '+'-separated field as combined emotions. Prevents uppercase topics like "TCP+UDP" from being misclassified as emotions. - Only instantiate Dialect() when entry is actually AAAK (checked via static looks_like_aaak() first). Plain text entries skip instantiation. - Add test for uppercase topic edge case. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
Both fixes are exactly right — the combined-emotion guard requiring at least one known emotion code before splitting is cleaner than my suggested approach anyway. Moving Dialect to module-level is a trivial win. Once you push, I'd suggest a quick test with something like Looking forward to seeing the updates. |
Per @web3guru888's suggestion, add explicit regression test for common protocol names (HTTPS+REST, TCP+UDP, HTTP+JSON, GRPC+PROTOBUF) to confirm the combined-emotion guard correctly classifies them as topics rather than emotion codes. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Summary
expand()method onDialectclass — converts AAAK-compressed text back into natural-language fragments suitable for vector embedding. Reverses entity codes to names, reconstructs topics, preserves key sentences, and maps emotion codes to readable words.looks_like_aaak()static heuristic — detects AAAK format by checking for pipe-separated fields with digit-colon prefixes.mcp_server.pydiary write — AAAK entries are now expanded before ChromaDB embedding. Original compressed form is preserved inaaak_compressedmetadata field. Plain text entries pass through unchanged (backward compatible).mcp_server.py:511— "Future versions should expand AAAK before embedding to improve semantic search quality"This should improve AAAK mode search quality (currently 84.2% R@5 vs 96.6% raw on LongMemEval) by embedding semantically rich text rather than compressed symbols.
Test plan
pytest tests/test_dialect.py -v— 29 tests pass (17 existing + 12 new)pytest tests/ -v— full suite 545 passed, 0 failedruff check— no lint errors🤖 Generated with Claude Code