Hey, great work on mempalace. The core insight — raw verbatim text + good embeddings beats LLM extraction — is genuinely valuable and your benchmark scripts are refreshingly reproducible.
I maintain agentmemory, a persistent memory system for AI coding agents. We recently ran LongMemEval-S ourselves and got 95.2% R@5 (BM25+vector hybrid) using the same all-MiniLM-L6-v2 embedding model. While comparing approaches, we dug into the benchmark methodology and wanted to share some findings that might help strengthen your claims.
What we verified
- 96.6% R@5 is reproducible. We confirmed this independently. The number is real and deterministic.
- 88.9% R@10 on LoCoMo (no rerank) appears to be the honest, clean score. Good result.
- Raw verbatim > LLM extraction — this insight holds up. Our BM25-only baseline gets 86.2% R@5 on the same dataset, confirming that good retrieval beats lossy compression.
What concerns us
1. Metric category error on LongMemEval
LongMemEval is an end-to-end QA benchmark. Every score on the published leaderboard is QA accuracy (retrieve + generate answer + GPT-4o judge). The 96.6% is recall_any@5 — a retrieval-only metric that never generates an answer or invokes a judge.
This makes the number incomparable to anything on the leaderboard:
| System |
Metric |
Score |
| OMEGA |
QA accuracy |
95.4% |
| Mastra |
QA accuracy |
84.2% (gpt-4o) |
| EmergenceMem |
QA accuracy |
86% |
| Oracle GPT-4o |
QA accuracy |
~82.4% |
| mempalace |
retrieval recall |
96.6% |
An independent tester (Issue #39) ran the full pipeline and got 82.6% QA accuracy — competitive but substantially different from 96.6%.
We label our own 95.2% explicitly as "retrieval recall, not end-to-end QA accuracy" in LONGMEMEVAL.md. Suggesting the same clarity here would help the community compare fairly.
2. The 100% R@5 score and the 3 targeted patches
The path from 96.6% → 100% involved 3 hand-coded patches for 3 specific failing questions (quoted-phrase boost, person-name boost, nostalgia pattern). Your own BENCHMARKS.md acknowledges this is "teaching to the test." The held-out split was created after the patches, not before.
The 98.4% held-out score is more credible but the split is post-hoc. A pre-registered dev/test split would make this bulletproof.
3. LoCoMo 100% with top_k=50
The 100% LoCoMo claim uses top_k=50 against conversations with at most 32 sessions — retrieving the entire conversation. Your BENCHMARKS.md correctly notes "the embedding retrieval step is bypassed entirely." The honest number (88.9% R@10 at top_k=10) should lead.
4. ConvoMem "2x Mem0"
The 92.9% is retrieval recall; Mem0's published numbers are QA accuracy. Different metrics on the same dataset.
5. --mode raw benchmarks ChromaDB, not mempalace
In raw mode, zero mempalace code executes — no palace, no wings, no rooms, no AAAK. The 96.6% is really a benchmark of ChromaDB + MiniLM-L6-v2. This is useful information but shouldn't be attributed to the palace architecture.
What we think is genuinely strong
- Zero-API retrieval quality — 96.6% R@5 with no external calls is legitimately the highest published zero-API retrieval score on LongMemEval-S
- Honest disclosure in BENCHMARKS.md — the caveats are documented, even if they don't make it to the README
- Reproducible scripts — anyone can verify the numbers
- Temporal knowledge graph with validity windows — this is architecturally interesting and something we don't have
- Minimal dependencies — ChromaDB + PyYAML is genuinely simpler than our setup
How agentmemory approaches this differently
We took the opposite bet: compress observations into structured facts/narratives to keep context injection under a token budget (~2K tokens vs raw text). Our search uses triple-stream retrieval (BM25 + vector + knowledge graph) with RRF fusion.
On LongMemEval-S with the same embedding model:
| System |
R@5 |
R@10 |
NDCG@10 |
| agentmemory BM25+Vector |
95.2% |
98.6% |
87.9% |
| agentmemory BM25-only |
86.2% |
94.6% |
73.0% |
| mempalace raw vector |
96.6% |
~97.6% |
— |
We're 1.4pp behind on R@5 but beat on R@10 (98.6% vs ~97.6%). BM25 adds recall depth that pure vector misses.
Suggestion: the two approaches are complementary
mempalace excels at "what does this codebase contain?" (static corpus search).
agentmemory excels at "what did I do across sessions?" (temporal memory).
A developer could use both:
- mempalace maps the territory (structure, relationships, communities)
- agentmemory remembers the journey (decisions, bugs, patterns learned)
Would be interesting to explore integration points — mempalace's knowledge graph feeding into agentmemory's context injection, for example.
Not trying to start a flame war — genuinely think both projects push the space forward. The benchmark methodology feedback is meant to strengthen your claims, not undermine them. Happy to discuss any of this.
Hey, great work on mempalace. The core insight — raw verbatim text + good embeddings beats LLM extraction — is genuinely valuable and your benchmark scripts are refreshingly reproducible.
I maintain agentmemory, a persistent memory system for AI coding agents. We recently ran LongMemEval-S ourselves and got 95.2% R@5 (BM25+vector hybrid) using the same
all-MiniLM-L6-v2embedding model. While comparing approaches, we dug into the benchmark methodology and wanted to share some findings that might help strengthen your claims.What we verified
What concerns us
1. Metric category error on LongMemEval
LongMemEval is an end-to-end QA benchmark. Every score on the published leaderboard is QA accuracy (retrieve + generate answer + GPT-4o judge). The 96.6% is
recall_any@5— a retrieval-only metric that never generates an answer or invokes a judge.This makes the number incomparable to anything on the leaderboard:
An independent tester (Issue #39) ran the full pipeline and got 82.6% QA accuracy — competitive but substantially different from 96.6%.
We label our own 95.2% explicitly as "retrieval recall, not end-to-end QA accuracy" in LONGMEMEVAL.md. Suggesting the same clarity here would help the community compare fairly.
2. The 100% R@5 score and the 3 targeted patches
The path from 96.6% → 100% involved 3 hand-coded patches for 3 specific failing questions (quoted-phrase boost, person-name boost, nostalgia pattern). Your own BENCHMARKS.md acknowledges this is "teaching to the test." The held-out split was created after the patches, not before.
The 98.4% held-out score is more credible but the split is post-hoc. A pre-registered dev/test split would make this bulletproof.
3. LoCoMo 100% with top_k=50
The 100% LoCoMo claim uses
top_k=50against conversations with at most 32 sessions — retrieving the entire conversation. Your BENCHMARKS.md correctly notes "the embedding retrieval step is bypassed entirely." The honest number (88.9% R@10 at top_k=10) should lead.4. ConvoMem "2x Mem0"
The 92.9% is retrieval recall; Mem0's published numbers are QA accuracy. Different metrics on the same dataset.
5.
--mode rawbenchmarks ChromaDB, not mempalaceIn raw mode, zero mempalace code executes — no palace, no wings, no rooms, no AAAK. The 96.6% is really a benchmark of ChromaDB + MiniLM-L6-v2. This is useful information but shouldn't be attributed to the palace architecture.
What we think is genuinely strong
How agentmemory approaches this differently
We took the opposite bet: compress observations into structured facts/narratives to keep context injection under a token budget (~2K tokens vs raw text). Our search uses triple-stream retrieval (BM25 + vector + knowledge graph) with RRF fusion.
On LongMemEval-S with the same embedding model:
We're 1.4pp behind on R@5 but beat on R@10 (98.6% vs ~97.6%). BM25 adds recall depth that pure vector misses.
Suggestion: the two approaches are complementary
mempalace excels at "what does this codebase contain?" (static corpus search).
agentmemory excels at "what did I do across sessions?" (temporal memory).
A developer could use both:
Would be interesting to explore integration points — mempalace's knowledge graph feeding into agentmemory's context injection, for example.
Not trying to start a flame war — genuinely think both projects push the space forward. The benchmark methodology feedback is meant to strengthen your claims, not undermine them. Happy to discuss any of this.