Benchmark methodology review + complementary approach from agentmemory

Hey, great work on mempalace. The core insight — raw verbatim text + good embeddings beats LLM extraction — is genuinely valuable and your benchmark scripts are refreshingly reproducible.

I maintain [agentmemory](https://github.com/rohitg00/agentmemory), a persistent memory system for AI coding agents. We recently ran LongMemEval-S ourselves and got **95.2% R@5** (BM25+vector hybrid) using the same `all-MiniLM-L6-v2` embedding model. While comparing approaches, we dug into the benchmark methodology and wanted to share some findings that might help strengthen your claims.

## What we verified

- **96.6% R@5 is reproducible.** We confirmed this independently. The number is real and deterministic.
- **88.9% R@10 on LoCoMo (no rerank)** appears to be the honest, clean score. Good result.
- **Raw verbatim > LLM extraction** — this insight holds up. Our BM25-only baseline gets 86.2% R@5 on the same dataset, confirming that good retrieval beats lossy compression.

## What concerns us

### 1. Metric category error on LongMemEval

LongMemEval is an end-to-end QA benchmark. Every score on the [published leaderboard](https://xiaowu0162.github.io/long-mem-eval/) is QA accuracy (retrieve + generate answer + GPT-4o judge). The 96.6% is `recall_any@5` — a retrieval-only metric that never generates an answer or invokes a judge.

This makes the number incomparable to anything on the leaderboard:

| System | Metric | Score |
|---|---|---|
| OMEGA | QA accuracy | 95.4% |
| Mastra | QA accuracy | 84.2% (gpt-4o) |
| EmergenceMem | QA accuracy | 86% |
| Oracle GPT-4o | QA accuracy | ~82.4% |
| **mempalace** | **retrieval recall** | **96.6%** |

An independent tester (Issue #39) ran the full pipeline and got **82.6% QA accuracy** — competitive but substantially different from 96.6%.

We label our own 95.2% explicitly as "retrieval recall, not end-to-end QA accuracy" in [LONGMEMEVAL.md](https://github.com/rohitg00/agentmemory/blob/main/benchmark/LONGMEMEVAL.md). Suggesting the same clarity here would help the community compare fairly.

### 2. The 100% R@5 score and the 3 targeted patches

The path from 96.6% → 100% involved 3 hand-coded patches for 3 specific failing questions (quoted-phrase boost, person-name boost, nostalgia pattern). Your own BENCHMARKS.md acknowledges this is "teaching to the test." The held-out split was created after the patches, not before.

The **98.4% held-out score is more credible** but the split is post-hoc. A pre-registered dev/test split would make this bulletproof.

### 3. LoCoMo 100% with top_k=50

The 100% LoCoMo claim uses `top_k=50` against conversations with at most 32 sessions — retrieving the entire conversation. Your BENCHMARKS.md correctly notes "the embedding retrieval step is bypassed entirely." The honest number (88.9% R@10 at top_k=10) should lead.

### 4. ConvoMem "2x Mem0"

The 92.9% is retrieval recall; Mem0's published numbers are QA accuracy. Different metrics on the same dataset.

### 5. `--mode raw` benchmarks ChromaDB, not mempalace

In raw mode, zero mempalace code executes — no palace, no wings, no rooms, no AAAK. The 96.6% is really a benchmark of ChromaDB + MiniLM-L6-v2. This is useful information but shouldn't be attributed to the palace architecture.

## What we think is genuinely strong

- **Zero-API retrieval quality** — 96.6% R@5 with no external calls is legitimately the highest published zero-API retrieval score on LongMemEval-S
- **Honest disclosure in BENCHMARKS.md** — the caveats are documented, even if they don't make it to the README
- **Reproducible scripts** — anyone can verify the numbers
- **Temporal knowledge graph with validity windows** — this is architecturally interesting and something we don't have
- **Minimal dependencies** — ChromaDB + PyYAML is genuinely simpler than our setup

## How agentmemory approaches this differently

We took the opposite bet: compress observations into structured facts/narratives to keep context injection under a token budget (~2K tokens vs raw text). Our search uses triple-stream retrieval (BM25 + vector + knowledge graph) with RRF fusion.

On LongMemEval-S with the same embedding model:

| System | R@5 | R@10 | NDCG@10 |
|---|---|---|---|
| agentmemory BM25+Vector | 95.2% | 98.6% | 87.9% |
| agentmemory BM25-only | 86.2% | 94.6% | 73.0% |
| mempalace raw vector | 96.6% | ~97.6% | — |

We're 1.4pp behind on R@5 but beat on R@10 (98.6% vs ~97.6%). BM25 adds recall depth that pure vector misses.

## Suggestion: the two approaches are complementary

mempalace excels at **"what does this codebase contain?"** (static corpus search).
agentmemory excels at **"what did I do across sessions?"** (temporal memory).

A developer could use both:
- mempalace maps the territory (structure, relationships, communities)  
- agentmemory remembers the journey (decisions, bugs, patterns learned)

Would be interesting to explore integration points — mempalace's knowledge graph feeding into agentmemory's context injection, for example.

---

Not trying to start a flame war — genuinely think both projects push the space forward. The benchmark methodology feedback is meant to strengthen your claims, not undermine them. Happy to discuss any of this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark methodology review + complementary approach from agentmemory #367

What we verified

What concerns us

1. Metric category error on LongMemEval

2. The 100% R@5 score and the 3 targeted patches

3. LoCoMo 100% with top_k=50

4. ConvoMem "2x Mem0"

5. `--mode raw` benchmarks ChromaDB, not mempalace

What we think is genuinely strong

How agentmemory approaches this differently

Suggestion: the two approaches are complementary

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

System	Metric	Score
OMEGA	QA accuracy	95.4%
Mastra	QA accuracy	84.2% (gpt-4o)
EmergenceMem	QA accuracy	86%
Oracle GPT-4o	QA accuracy	~82.4%
mempalace	retrieval recall	96.6%

System	R@5	R@10	NDCG@10
agentmemory BM25+Vector	95.2%	98.6%	87.9%
agentmemory BM25-only	86.2%	94.6%	73.0%
mempalace raw vector	96.6%	~97.6%	—

Benchmark methodology review + complementary approach from agentmemory #367

Description

What we verified

What concerns us

1. Metric category error on LongMemEval

2. The 100% R@5 score and the 3 targeted patches

3. LoCoMo 100% with top_k=50

4. ConvoMem "2x Mem0"

5. --mode raw benchmarks ChromaDB, not mempalace

What we think is genuinely strong

How agentmemory approaches this differently

Suggestion: the two approaches are complementary

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

5. `--mode raw` benchmarks ChromaDB, not mempalace