I've been doing reviews of agentic memory systems and figured I'd flag this since no other system in my survey has had this pattern before where the README claims do not match what's in the code to such a degree.
| README claim |
What the code actually does |
Severity |
| "Contradiction detection" — automatically flags inconsistencies against the knowledge graph |
knowledge_graph.py has no contradiction detection. The only dedup is blocking identical open triples (same subject/predicate/object where valid_to IS NULL). Conflicting facts (e.g., two different married_to values) accumulate silently. |
Feature does not exist |
| "30x compression, zero information loss" — AAAK described as "lossless shorthand" |
AAAK is lossy abbreviation: regex entity codes + keyword frequency + 55-char sentence truncation. decode() is string splitting — no original text reconstruction. Token counting uses len(text)//3 heuristic. LongMemEval drops from 96.6% to 84.2% in AAAK mode — a 12.4pp quality loss. |
Claim is false |
| 96.6% LongMemEval R@5 (headline, positioned as MemPalace's score) |
Real score, but measured in "raw mode" — uncompressed verbatim text stored in ChromaDB, standard nearest-neighbor retrieval. The palace structure (wings/rooms/halls) is not involved. This measures ChromaDB's default embedding model performance, not MemPalace. |
Misleading attribution |
| "+34% retrieval boost from palace structure" |
Narrowing search scope from all drawers → wing → wing+room. This is metadata filtering — a standard technique in any vector DB, not a novel retrieval mechanism. |
Misleading framing |
| "100% with Haiku rerank" |
Not in the benchmark scripts. Method undocumented and unverifiable from the repo. |
Unverifiable |
| "Closets" as compressed summaries |
AAAK produces abbreviations, not summaries. No evidence of a separate closet storage tier distinct from drawers. |
Nomenclature mismatch |
| Hall types structurally enforced |
Halls exist as metadata strings but are not used in retrieval ranking or enforced as constraints. |
Conceptual, not functional |
There's a lot to like conceptually, but between this and the benchmarks (LongMemEval is using raw ChromaDB, which just measures its embeddings, not using the palace structure at all, both AAAK and room-boosting decrease the score, ConvoMem is extremely truncated), is... concerning.
Full analysis for review: https://github.com/lhl/agentic-memory/blob/main/ANALYSIS-mempalace.md
UPDATE: @milla-jovovich has acknowledged our findings and has been actively pushing fixes. 🥳
For those interested to see remediations, links to comments in this issue:
I've been doing reviews of agentic memory systems and figured I'd flag this since no other system in my survey has had this pattern before where the README claims do not match what's in the code to such a degree.
knowledge_graph.pyhas no contradiction detection. The only dedup is blocking identical open triples (same subject/predicate/object wherevalid_to IS NULL). Conflicting facts (e.g., two differentmarried_tovalues) accumulate silently.decode()is string splitting — no original text reconstruction. Token counting useslen(text)//3heuristic. LongMemEval drops from 96.6% to 84.2% in AAAK mode — a 12.4pp quality loss.There's a lot to like conceptually, but between this and the benchmarks (LongMemEval is using raw ChromaDB, which just measures its embeddings, not using the palace structure at all, both AAAK and room-boosting decrease the score, ConvoMem is extremely truncated), is... concerning.
Full analysis for review: https://github.com/lhl/agentic-memory/blob/main/ANALYSIS-mempalace.md
UPDATE: @milla-jovovich has acknowledged our findings and has been actively pushing fixes. 🥳
For those interested to see remediations, links to comments in this issue: