Multiple issues with benchmark methodology and scoring

_Disclosure: I'm working on a different AI Memory project and an author of a public LoCoMo ground-truth audit: https://github.com/dial481/locomo-audit_ 

The 100% LoCoMo claim is what immediately brought my attention to this repository.

### 1. 100% on LoCoMo should not be achievable. The ground truth is broken.

Our audit documents ~99 wrong, hallucinated, misattributed, or ambiguous answers in the LoCoMo ground truth across all ten conversations (`errors_conv_0.json` through `errors_conv_9.json`). Examples include hallucinated objects ("symbols," "bowl") and speaker-attribution errors where the evidence dialog is spoken by the wrong character. The honest ceiling on LoCoMo as published is roughly **93–94%**, not 100%.

A reported 100% therefore implies one of two things: the system is wrong in the same ways the ground truth is wrong, or the metric being reported is not reliably measuring answer correctness.

With respect to the second case, the audit covers the fact that the LLM judge in LoCoMo accepts up to ~63% of intentionally wrong answers.

### 2. The 100% is a retrieval bypass. Disclosed in this repo, stripped from the launch tweet.

`benchmarks/BENCHMARKS.md`, verbatim:

> "The LoCoMo 100% result with top-k=50 has a structural issue: each of the 10 conversations has 19–32 sessions, but top-k=50 exceeds that count. This means the ground-truth session is always in the candidate pool regardless of the embedding model's ranking. The Sonnet rerank is essentially doing reading comprehension over all sessions — the embedding retrieval step is bypassed entirely."

Verified against the dataset. Session counts per conversation: **19, 19, 32, 29, 29, 28, 31, 30, 25, 30**. Every conversation has fewer than 50 sessions. Setting `top_k=50` retrieves the entire conversation. The "memory system" contributes nothing at this setting. The pipeline reduces to: dump every session into Claude Sonnet, ask Sonnet which one matches. That is `cat *.txt | claude`. It is not retrieval and it is not memory.

Honest LoCoMo numbers from the same file: **60.3% R@10 with no rerank, 88.9% R@10 with hybrid v5 and no LLM**. 100% should not be cited at all.

### 3. The LongMemEval 100% is hand-coded against three specific questions. Also disclosed in this repo.

`benchmarks/BENCHMARKS.md`, verbatim:

> "This is teaching to the test. The fixes were designed around the exact failure cases, not discovered by analyzing general failure patterns."

The hybrid v4 changes were written by inspecting the three remaining wrong answers in the dev set and adding code targeted at those exact questions: a quoted-phrase boost for one question containing `'sexual compulsions'` in single quotes, a person-name boost for a question about "Rachel," and "I still remember" / "when I was in high school" patterns for one question about a high school reunion. Three patches for three questions. Then the result is reported as "first perfect score on LongMemEval, 500/500."

The held-out 450 score in the same document is **98.4%**. LongMemEval has no official train/dev/test split, so the 50-question dev set was carved out after the patches were written rather than before. The 98.4% is less contaminated than the 100%, but as section 4 covers, neither number is comparable to the published LongMemEval leaderboard.

### 4. What the 96.6% raw number actually measures, and why it is not comparable to the published LongMemEval leaderboard.

LongMemEval as published is an end-to-end QA benchmark. A system has to (a) retrieve from the haystack, (b) generate an answer, and (c) have that answer marked correct by a GPT-4 judge. Every score on the published LongMemEval leaderboard is the percentage of questions where the generated answer was judged correct.

This repository's runner (`benchmarks/longmemeval_bench.py`) does step (a) only. It never generates an answer and never invokes a judge. For each of the 500 questions in the "s" variant of LongMemEval (~50 prior sessions per haystack, ~115K tokens), the runner:

1. Builds one document per session by concatenating only the user turns in that session. Assistant turns are not indexed (line 189–190).
2. Embeds with default ChromaDB embeddings (`all-MiniLM-L6-v2`, 384-dim).
3. Returns the top 5 sessions by cosine distance to the question text.
4. Reads `answer_session_ids` from the dataset (the gold session IDs labeled by the LongMemEval authors) and checks set membership: if **any one** of the gold IDs appears in the top 5, the question scores 1.0. This is `recall_any@5` (line 77).
5. Averages across all 500 questions.

That is the entire pipeline. The system never reads what is in the retrieved sessions. It never produces an answer. It never demonstrates that the sessions it returned actually answer the question. The dataset author labeled them, the runner checks the labels, and credit is awarded on label-set overlap.

Two things make `recall_any@5` materially easier than the published LongMemEval task:

- **It substitutes retrieval recall for answer correctness.** The leaderboard measures generated answers judged correct. `recall_any@5` measures whether a labeled session ID appears in a top-5 list. These are different tasks. A system that retrieves perfectly and then answers wrong scores 100% under `recall_any@5` and 0% on the leaderboard. The two numbers should not be placed in the same table.
- **`recall_any` is the softer of the two retrieval metrics the runner computes.** The runner also computes `recall_all@5` (line 78), which requires every labeled gold session to appear in the top 5. For questions with multiple required sessions, `recall_any` gives full credit for finding one of N. `recall_all` gives credit only for finding all N. The reported number is `recall_any`.

Stripping assistant turns from the index does not reduce the number of candidate sessions (the runner still indexes one blob per session), but it does reduce noise per blob and removes a category of content from search. Whether this helps or hurts depends on the question type. The bigger lever on the score is the choice of recall_any over recall_all on multi-session questions, which roughly multiplies the odds of full credit by the number of gold sessions per question.

So is the 96.6% reproducible? Yes. Is it cheating? Not in the sense of touching the test set, it is what the runner deterministically produces. Is it a perfect score on LongMemEval? No. It is `recall_any@5` over user-turn-only embeddings on the small variant, which is a substantially easier task than the published LongMemEval QA leaderboard.

Stepping back: none of the LongMemEval numbers in this repository are publishable as LongMemEval scores. LongMemEval is an end-to-end QA benchmark. This runner never generates an answer and never invokes a judge. The 100%, the 98.4%, and the 96.6% are all recall_any@5 retrieval numbers on the LongMemEval-s haystacks. The 96.6% is reproducible and is an interesting internal retrieval result on its own terms (raw default embeddings put a gold session ID in the top 5 96.6% of the time, with no LLM in the loop), but it is not comparable to anything on the published LongMemEval QA leaderboard. Calling any of these numbers a "LongMemEval score" is a metric category error, regardless of which configuration produced them.

There is also a more fundamental question about whether session-level R@5 is a meaningful memory benchmark in 2026. Each LongMemEval-s haystack is ~115K tokens against ~50 candidate sessions. Sonnet has a 200K context window, multiple SOTA models now have 1 million context windows. The haystack fits in-context with room to spare. The retrieval framing tests a constraint that may no longer exist in many deployments.

### 5. ConvoMem "2× Mem0" is a metric mismatch.

The 92.9% is retrieval-based ("whether retrieved context enables correct answers"). Mem0's published ConvoMem numbers are end-to-end QA accuracy. The "more than 2× Mem0" claim is comparing two different metrics on the same dataset. A like-for-like comparison would either run Mem0 through the same retrieval-recall harness or run this system through end-to-end QA with the same judge.

### 6. Question count: 1,986 includes 446 category-5 adversarial questions that are rarely evaluated

LoCoMo ships 1,986 QA pairs. 446 of those are category 5 (adversarial / unanswerable / often malformed). The standard convention in published LoCoMo evaluation is to report on the **1,540 non-adversarial** subset, both because cat 5 has documented ground-truth problems of its own and because the category is "the model should refuse," not memory recall. Reporting "1,986 multi-hop questions" without the disclosure conflates conventions and inflates the apparent dataset size.

### 7. "100% beats every product, no API key, no cloud" is two configurations welded together.

Both 100% scores in this repository require a paid Claude API call (Haiku for LongMemEval, Sonnet for LoCoMo). The "no API key, no cloud, runs locally" description belongs to the no-LLM mode, whose actual numbers are 96.6% R@5 on LongMemEval and 60.3% R@10 on LoCoMo. The launch post takes the score from the LLM mode and the deployment story from the no-LLM mode. 

### 8. AAAK "30× lossless compression, any LLM reads natively" has no eval in this repo.

We could not find a round-trip evaluation (compress → decompress → measure information retention) anywhere in `benchmarks/` or `tests/`. Lossless 30× compression of natural language sits against information-theoretic bounds for English. "Any LLM reads it natively" is a cross-model generalization claim that would require evaluation on multiple model families. If those evals exist, please link them. Otherwise a more appropriate claim might be "structured shorthand," not "lossless compression."

### Summary

| Claim in launch post | What the repo actually shows |
|---|---|
| 100% on LoCoMo, every category | 60.3% no rerank / 88.9% no-LLM hybrid; 100% is top-k > session count, retrieval bypassed (disclosed line 498 of BENCHMARKS.md) |
| First perfect score on LongMemEval | None of the LongMemEval numbers in this repo are LongMemEval scores. The runner measures recall_any@5 retrieval and never generates or judges an answer. The 100% is also three hand-coded patches for three specific questions (disclosed line 461). |
| Beats every product, no API key | Both 100% scores require paid Claude API calls |
| 2× Mem0 on ConvoMem | Retrieval recall vs Mem0's end-to-end QA — different metrics |
| 1,986 multi-hop questions | 1,540 non-adversarial + 446 cat-5 adversarial that other evaluators exclude |
| 30× lossless compression | No round-trip eval in the repository |
| 100% on LoCoMo (separately) | Mathematically not achievable; LoCoMo ground truth has ~99 documented errors per our audit; honest ceiling ~93–94% |


>[Ben Sigman](https://x.com/bensig)
>[@bensig](https://x.com/bensig)
>My friend Milla Jovovich and I spent months creating an AI memory system with Claude. It just posted a perfect score on the standard benchmark - beating every product in the space, free or paid.

>It's called MemPalace, and it works nothing like anything else out there.

>Instead of sending your data to a background agent in the cloud, it mines your conversations locally and organizes them into a palace - a structured architecture with wings, halls, and rooms that mirrors how human memory actually works.

>Here is what that gets you:

>→ Your AI knows who you are before you type a single word - family, projects, preferences, loaded in ~120 tokens
>→ Palace architecture organizes memories by domain and type - not a flat list of facts, a navigable structure
>→ Semantic search across months of conversations finds the answer in position 1 or 2
>→ AAAK compression fits your entire life context into 120 tokens - 30x lossless compression any LLM reads natively
>→ Contradiction detection catches wrong names, wrong pronouns, wrong ages before you ever see them

>The benchmarks:

>100% recall on LongMemEval — first perfect score ever recorded. 500/500 questions. Every question type at 100%.

>92.9% on ConvoMem — more than 2x Mem0's score.

>100% on LoCoMo — every multi-hop reasoning category, including temporal inference which stumps most systems.

>No API key. No cloud. No subscription. One dependency. Runs on your machine. Your memories never leave.

>MIT License. 100% Open Source.

<img width="917" height="449" alt="Image" src="https://github.com/user-attachments/assets/d0ca032c-f0ec-4e51-872b-e5508ee82938" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple issues with benchmark methodology and scoring #29

1. 100% on LoCoMo should not be achievable. The ground truth is broken.

2. The 100% is a retrieval bypass. Disclosed in this repo, stripped from the launch tweet.

3. The LongMemEval 100% is hand-coded against three specific questions. Also disclosed in this repo.

4. What the 96.6% raw number actually measures, and why it is not comparable to the published LongMemEval leaderboard.

5. ConvoMem "2× Mem0" is a metric mismatch.

6. Question count: 1,986 includes 446 category-5 adversarial questions that are rarely evaluated

7. "100% beats every product, no API key, no cloud" is two configurations welded together.

8. AAAK "30× lossless compression, any LLM reads natively" has no eval in this repo.

Summary

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Claim in launch post	What the repo actually shows
100% on LoCoMo, every category	60.3% no rerank / 88.9% no-LLM hybrid; 100% is top-k > session count, retrieval bypassed (disclosed line 498 of BENCHMARKS.md)
First perfect score on LongMemEval	None of the LongMemEval numbers in this repo are LongMemEval scores. The runner measures recall_any@5 retrieval and never generates or judges an answer. The 100% is also three hand-coded patches for three specific questions (disclosed line 461).
Beats every product, no API key	Both 100% scores require paid Claude API calls
2× Mem0 on ConvoMem	Retrieval recall vs Mem0's end-to-end QA — different metrics
1,986 multi-hop questions	1,540 non-adversarial + 446 cat-5 adversarial that other evaluators exclude
30× lossless compression	No round-trip eval in the repository
100% on LoCoMo (separately)	Mathematically not achievable; LoCoMo ground truth has ~99 documented errors per our audit; honest ceiling ~93–94%

Multiple issues with benchmark methodology and scoring #29

Description

1. 100% on LoCoMo should not be achievable. The ground truth is broken.

2. The 100% is a retrieval bypass. Disclosed in this repo, stripped from the launch tweet.

3. The LongMemEval 100% is hand-coded against three specific questions. Also disclosed in this repo.

4. What the 96.6% raw number actually measures, and why it is not comparable to the published LongMemEval leaderboard.

5. ConvoMem "2× Mem0" is a metric mismatch.

6. Question count: 1,986 includes 446 category-5 adversarial questions that are rarely evaluated

7. "100% beats every product, no API key, no cloud" is two configurations welded together.

8. AAAK "30× lossless compression, any LLM reads natively" has no eval in this repo.

Summary

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions