Skip to content

Multiple issues between README claims and codebase #27

@lhl

Description

@lhl

I've been doing reviews of agentic memory systems and figured I'd flag this since no other system in my survey has had this pattern before where the README claims do not match what's in the code to such a degree.

README claim What the code actually does Severity
"Contradiction detection" — automatically flags inconsistencies against the knowledge graph knowledge_graph.py has no contradiction detection. The only dedup is blocking identical open triples (same subject/predicate/object where valid_to IS NULL). Conflicting facts (e.g., two different married_to values) accumulate silently. Feature does not exist
"30x compression, zero information loss" — AAAK described as "lossless shorthand" AAAK is lossy abbreviation: regex entity codes + keyword frequency + 55-char sentence truncation. decode() is string splitting — no original text reconstruction. Token counting uses len(text)//3 heuristic. LongMemEval drops from 96.6% to 84.2% in AAAK mode — a 12.4pp quality loss. Claim is false
96.6% LongMemEval R@5 (headline, positioned as MemPalace's score) Real score, but measured in "raw mode" — uncompressed verbatim text stored in ChromaDB, standard nearest-neighbor retrieval. The palace structure (wings/rooms/halls) is not involved. This measures ChromaDB's default embedding model performance, not MemPalace. Misleading attribution
"+34% retrieval boost from palace structure" Narrowing search scope from all drawers → wing → wing+room. This is metadata filtering — a standard technique in any vector DB, not a novel retrieval mechanism. Misleading framing
"100% with Haiku rerank" Not in the benchmark scripts. Method undocumented and unverifiable from the repo. Unverifiable
"Closets" as compressed summaries AAAK produces abbreviations, not summaries. No evidence of a separate closet storage tier distinct from drawers. Nomenclature mismatch
Hall types structurally enforced Halls exist as metadata strings but are not used in retrieval ranking or enforced as constraints. Conceptual, not functional

There's a lot to like conceptually, but between this and the benchmarks (LongMemEval is using raw ChromaDB, which just measures its embeddings, not using the palace structure at all, both AAAK and room-boosting decrease the score, ConvoMem is extremely truncated), is... concerning.

Full analysis for review: https://github.com/lhl/agentic-memory/blob/main/ANALYSIS-mempalace.md


UPDATE: @milla-jovovich has acknowledged our findings and has been actively pushing fixes. 🥳

For those interested to see remediations, links to comments in this issue:

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdocumentationImprovements or additions to documentation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions