bench: add BEAM 100K benchmark (end-to-end answer quality) by rohithzr · Pull Request #168 · MemPalace/mempalace

rohithzr · 2026-04-07T23:08:23Z

What this adds

BEAM 100K benchmark runner (benchmarks/beam_100k_bench.py) and a standalone data conversion script (benchmarks/convert_beam.py).

This is the first benchmark in this repo that measures end-to-end answer quality, not retrieval recall.

How it works

Standard RAG evaluation using MemPalace's own patterns:

Auto-downloads BEAM parquet from HuggingFace on first run, converts to JSON, caches locally
Ingests all user messages into chromadb.EphemeralClient() (same as longmemeval_bench.py)
Top-K retrieval per question via ChromaDB semantic search
LLM synthesizes an answer from retrieved chunks
Scores each rubric item using the official BEAM judge prompt (3-tier: 1.0 / 0.5 / 0.0)

Follows the same code patterns as existing benchmarks: _fresh_collection(), _make_embed_fn(), _bench_client, argparse conventions, JSONL debug output.

What BEAM tests that existing benchmarks don't

BEAM evaluates 10 distinct memory abilities across 20 conversations and 400 questions:

Ability	What it tests
Information Extraction	Factual recall from conversation history
Multi-Session Reasoning	Combining information across sessions
Temporal Reasoning	Date/duration calculations
Contradiction Resolution	Detecting conflicting statements
Knowledge Update	Tracking changed information
Preference Following	Respecting stated preferences
Instruction Following	Format/style adherence
Event Ordering	Chronological reconstruction
Summarization	Comprehensive synthesis
Abstention	Refusing unanswerable questions

Contradiction resolution, summarization, event ordering, and abstention are not covered by any existing benchmark in this repo.

Running it

pip install openai pandas pyarrow

# Set LLM credentials
export OPENAI_API_KEY=...
# or Azure:
export AZURE_OPENAI_API_KEY=...
export AZURE_OPENAI_ENDPOINT=...

# Single conversation (quick test, ~2 min, auto-downloads dataset)
python benchmarks/beam_100k_bench.py

# Full 20-conversation run (~40 min)
python benchmarks/beam_100k_bench.py --full

# With pre-downloaded dataset
python benchmarks/beam_100k_bench.py data/beam-100k.json --full

# With alternative embeddings
python benchmarks/beam_100k_bench.py --embed-model bge-large --full

Results (full run, 20 conversations)

Config: ChromaDB default embeddings (all-MiniLM-L6-v2), top-10 retrieval, GPT-5.4-mini synthesis + judge, temperature 0.0.

Overall: 515/1051 rubric checks passed (49.0%)

Ability	Passed	Total	Accuracy
Preference Following	59	74	80%
Abstention	28	40	70%
Temporal Reasoning	52	75	69%
Instruction Following	39	59	66%
Multi-Session Reasoning	63	101	62%
Information Extraction	53	92	58%
Knowledge Update	19	42	45%
Contradiction Resolution	64	160	40%
Summarization	69	195	35%
Event Ordering	69	213	32%

Abilities that reduce to "find the right chunk" score well. Abilities requiring cross-chunk reasoning (contradiction, summarization, event ordering) score lower, which is expected for a retrieval-only architecture.

Files

benchmarks/beam_100k_bench.py (728 lines) - benchmark runner with auto-download
benchmarks/convert_beam.py (129 lines) - standalone parquet-to-JSON converter
.gitignore - added benchmarks/.beam_cache/

Context: #125

End-to-end answer quality evaluation using the BEAM benchmark (Tavakoli et al., 2024). Tests 10 memory abilities across 20 conversations and 400 questions with the official 3-tier rubric judge (1.0 / 0.5 / 0.0). Pipeline: ChromaDB retrieval -> LLM synthesis -> BEAM rubric judge. Follows the same patterns as existing LongMemEval and LoCoMo benchmarks (EphemeralClient, _fresh_collection, _make_embed_fn, argparse conventions). Auto-downloads the BEAM parquet from HuggingFace on first run, converts to JSON, and caches locally. Also includes convert_beam.py for standalone manual conversion. Requires: pip install openai pandas pyarrow

Adds --mode hybrid (keyword overlap re-ranking, same logic as longmemeval_bench.py build_palace_and_retrieve_hybrid) and --llm-rerank (Claude Haiku reranking, same logic as longmemeval_bench.py llm_rerank). Results across all three modes on full 20-conversation run: raw: 49.0% (515/1051) hybrid: 43.0% (452/1051) hybrid + haiku rerank: 43.6% (458/1051)

Adds --mode aaak (compress messages through mempalace.dialect.Dialect before indexing) and --aaak-spec (include AAAK dialect specification from mcp_server.py in the synthesis prompt). AAAK is a novel approach to context compression. This implementation is based on the code in dialect.py and the AAAK_SPEC in mcp_server.py. If there are additional prompting strategies for helping the LLM read AAAK format, those could improve these numbers. Results across all five modes on full 20-conversation run: raw (k=10): 49.0% (515/1051) hybrid (k=10): 43.0% (452/1051) hybrid + haiku rerank (k=10): 43.6% (458/1051) aaak (k=50): 26.2% (275/1051) aaak + spec (k=50): 27.9% (293/1051)

bgauryy · 2026-04-08T23:05:49Z

PR Review: bench: add BEAM 100K benchmark (end-to-end answer quality)

Executive Summary

Aspect	Value
PR Goal	Add BEAM 100K benchmark runner evaluating MemPalace as a RAG memory backend for end-to-end answer quality
Files Changed	3 (+1060 / -0)
Risk Level	🟢 LOW - New benchmark files, no production code touched
Review Effort	3 - Moderate (930-line benchmark + 129-line converter, heavy code duplication)
Recommendation	🔄 REQUEST_CHANGES

Affected Areas: benchmarks/ (new files), .gitignore (cache path)

Business Impact: Adds the first end-to-end answer quality benchmark (vs. retrieval-only). Enables measuring MemPalace RAG quality against BEAM's official rubric judge.

Flow Changes: None — purely additive. No changes to core MemPalace code.

Ratings

Aspect	Score
Correctness	4/5
Security	3/5
Performance	3/5
Maintainability	2/5

PR Health

Has clear description
References ticket/issue (if applicable)
Appropriate size (or justified if large)
Has relevant tests (if applicable)

High Priority Issues

🔗 #1: Massive code duplication between bench and converter

Location: benchmarks/convert_beam.py:1-129 vs benchmarks/beam_100k_bench.py:63-175 | Confidence: ✅ HIGH

The entire convert_beam.py (129 lines) is a near-verbatim copy of _convert_parquet_to_json() + surrounding helpers from beam_100k_bench.py. The chat-turn parsing, time-anchor cleaning, question extraction, and output schema are duplicated line-for-line. Any future bug fix in parsing logic must be applied in two places.

Fix: Have convert_beam.py import and call the functions from beam_100k_bench.py, or extract the shared conversion logic into a benchmarks/_beam_utils.py module that both files import.

- # convert_beam.py: 100+ lines of duplicated parsing logic
+ from beam_100k_bench import _convert_parquet_to_json, _download_beam_parquet
+
+ def main():
+     input_file = sys.argv[1] if len(sys.argv) > 1 else "data/beam-100k.parquet"
+     output_file = sys.argv[2] if len(sys.argv) > 2 else "data/beam-100k.json"
+     _convert_parquet_to_json(input_file, output_file)

Medium Priority Issues

🐛 #2: Judge score double-quantization may mask LLM output issues

Location: benchmarks/beam_100k_bench.py:358-366 | Confidence: ⚠️ MED

The BEAM judge prompt instructs the LLM to return exactly 1.0, 0.5, or 0.0. But judge_rubric() re-thresholds the parsed score (>=0.75 → 1.0, >=0.25 → 0.5, else 0.0). If the LLM correctly returns 0.5, this survives. But if it returns an unexpected value like 0.7 (partial-ish), the double-quantization silently rounds it to 1.0 — inflating scores. The BEAM paper expects the judge to return only the three canonical values.

  try:
      parsed = json.loads(response)
      score = float(parsed.get("score", 0))
-     if score >= 0.75:
-         return 1.0
-     elif score >= 0.25:
-         return 0.5
-     else:
-         return 0.0
+     # BEAM rubric judge should return exactly 1.0, 0.5, or 0.0
+     if score in (1.0, 0.5, 0.0):
+         return score
+     # Fallback for unexpected values: snap to nearest tier
+     if score >= 0.75:
+         return 1.0
+     elif score >= 0.25:
+         return 0.5
+     return 0.0

🚨 #3: SSL verification globally disabled at import time

Location: benchmarks/beam_100k_bench.py:52 | Confidence: ⚠️ MED

ssl._create_default_https_context = ssl._create_unverified_context disables SSL certificate verification process-wide, not just for the BEAM download. Any subsequent HTTPS call (OpenAI API, Anthropic API) in the same process will skip cert verification. This is a known pattern from convomem_bench.py, but the LLM API calls in this benchmark make the attack surface wider.

Consider scoping the bypass to just the download function:

- ssl._create_default_https_context = ssl._create_unverified_context
  ...
  def _download_beam_parquet(cache_dir):
+     ctx = ssl.create_default_context()
+     ctx.check_hostname = False
+     ctx.verify_mode = ssl.CERT_NONE
      ...
-     urllib.request.urlretrieve(HF_BEAM_URL, parquet_path)
+     with urllib.request.urlopen(HF_BEAM_URL, context=ctx) as resp:
+         with open(parquet_path, "wb") as f:
+             f.write(resp.read())

⚡ #4: Module-level `chromadb.EphemeralClient()` executes on import

Location: benchmarks/beam_100k_bench.py:249 | Confidence: ⚠️ MED

_bench_client = chromadb.EphemeralClient() runs at import time, creating a ChromaDB client even if you only import a helper function. This matches longmemeval_bench.py's pattern, so it's consistent — but worth noting that it means import beam_100k_bench has side effects (allocates an in-memory DB).

Consider lazy initialization:

- _bench_client = chromadb.EphemeralClient()
+ _bench_client = None
+
+ def _get_client():
+     global _bench_client
+     if _bench_client is None:
+         _bench_client = chromadb.EphemeralClient()
+     return _bench_client

Low Priority Issues

🎨 #5: Hardcoded model version strings will go stale

Location: benchmarks/beam_100k_bench.py:36,211,395 | Confidence: ⚠️ MED

The default model is gpt-5.4-mini (Azure) and claude-haiku-4-5-20251001 (reranker). These will go stale as new model versions are released. Other benchmarks in the repo have the same pattern, but centralizing model defaults (or documenting them in BENCHMARKS.md) would help maintainability.

🎨 #6: No-rubric fallback scoring is coarse

Location: benchmarks/beam_100k_bench.py:447-454 | Confidence: ⚠️ MED

When a question has no rubric items, the code scores pass/fail based on len(answer) > 50. This arbitrary character threshold means a 51-character hallucination scores identically to a perfect answer. Consider at minimum logging a warning that unscored questions were encountered, or using the reference answer for a simple string comparison.

Flow Impact Analysis

No existing code paths are modified. The PR is purely additive:

benchmarks/beam_100k_bench.py (NEW)
  └── uses: chromadb.EphemeralClient (direct, not palace_db.py — consistent with other benchmarks)
  └── uses: mempalace.dialect.Dialect (optional AAAK mode)
  └── uses: openai.AzureOpenAI / openai.OpenAI (synthesis + judging)
  └── uses: Anthropic API via urllib (optional LLM reranking)

benchmarks/convert_beam.py (NEW)
  └── standalone CLI script, no MemPalace imports

.gitignore
  └── adds: benchmarks/.beam_cache/

Created by Octocode MCP https://octocode.ai 🔍🐙

Three issues from the PR review have been fixed. Three were skipped as false positives or matching existing repo patterns. Fixed: 1. Code duplication between bench and converter (HIGH). Extracted shared parsing logic into benchmarks/_beam_utils.py. - convert_beam.py is now a 38-line wrapper. - beam_100k_bench.py imports from _beam_utils. - Net 54 lines removed, parsing logic exists in one place. 2. SSL verification scoped to BEAM download only (MED). Removed module-level ssl._create_default_https_context bypass that disabled cert verification process-wide. SSL is now skipped only inside download_beam_parquet() via a per-request SSLContext, so LLM API calls (OpenAI, Anthropic) keep default cert verification. 3. Removed misleading no-rubric fallback (LOW). The previous fallback scored answers as PASS based on len(answer) > 50, which would have inflated scores on hallucinated answers if a question ever shipped without a rubric. BEAM 100K has 0/400 such questions, but the new behavior prints [SKIP] and excludes the question from totals. Skipped (with rationale): - Judge double-quantization: intentional, matches canonical BEAM Rust implementation. LLMs occasionally return non-canonical scores like 0.7 and snapping to nearest tier is correct BEAM methodology. - Lazy ChromaDB init: matches existing pattern in longmemeval_bench.py and convomem_bench.py. Changing one file breaks consistency. - Hardcoded model strings: already overridable via AZURE_OPENAI_CHAT_MODEL env var and --llm-model CLI flag.

rohithzr · 2026-04-09T18:59:08Z

Thanks for the review @bgauryy. Pushed a follow-up commit (8dfae0d) addressing the issues that hold up under inspection. Going through each finding:

Fixed

Code duplication (HIGH). Real issue. Extracted shared parsing logic into benchmarks/_beam_utils.py. convert_beam.py is now a 38-line wrapper that calls convert_parquet_to_json(). beam_100k_bench.py imports ensure_beam_dataset() from the same module. Net 54 lines removed, parsing logic exists in one place. Future bug fixes only need to land in _beam_utils.py.

SSL verification scoped (MED). Real issue. Removed the module-level ssl._create_default_https_context = ssl._create_unverified_context bypass that disabled certificate verification process-wide. SSL is now skipped only inside download_beam_parquet() via a per-request SSLContext, so LLM API calls (OpenAI, Anthropic) keep default certificate verification. This is a tighter scope than the existing convomem_bench.py pattern, intentionally.

No-rubric fallback (LOW). Real issue. The previous fallback scored answers as PASS based on len(answer) > 50, which would have inflated scores on hallucinated 51-character answers if a question shipped without a rubric. Verified 0/400 BEAM questions trigger this branch, but the new behavior prints [SKIP] and excludes the question from totals. Safer if a future dataset variant ever ships unscored questions.

Skipped (with rationale)

Judge double-quantization. This one looks like a bug but isn't. The threshold snapping is intentional and matches the canonical BEAM Rust reference implementation in Karta's beam_100k.rs. LLMs occasionally return non-canonical scores like 0.7 despite the prompt asking for exactly 1.0/0.5/0.0, and snapping to the nearest tier is the standard BEAM normalization. The proposed "fix" produces identical behavior in the well-behaved case and the same fallback in the misbehaved case, so it would just add code without changing scores.

Lazy ChromaDB initialization. Matches the existing pattern in longmemeval_bench.py and convomem_bench.py, both of which create the EphemeralClient at module load. Changing it for one file breaks consistency without solving a real problem (the benchmark scripts are entry points, not libraries that get imported as helpers).

Hardcoded model strings. Already overridable via AZURE_OPENAI_CHAT_MODEL env var, OPENAI_CHAT_MODEL env var, and --llm-model CLI flag. The hardcoded values are documented defaults, not constraints.

The branch is also synced with origin/main (75 commits, no conflicts).

bgauryy · 2026-04-09T19:25:58Z

Thanks for the follow-up @rohithzr
I verified each claim against the code on the branch. The three fixes are solid and the commit is clean. Quick note on one point:

ChromaDB lazy init

You mention this matches longmemeval_bench.py and convomem_bench.py, both creating EphemeralClient at module load. That's accurate for longmemeval_bench.py, but convomem_bench.py actually uses chromadb.PersistentClient(path=...) inside a function (~line 181) -- which is the lazy pattern I originally suggested.

Not a blocker -- these scripts are entry points, not importable libraries, so the practical risk is low. Just wanted to correct the record since the consistency argument only holds for one of the two benchmarks cited.

Everything else checks out. The Karta reference impl confirms the judge snapping, SSL scoping is clean, and the _beam_utils.py extraction is well done. LGTM.

Module-level chromadb.EphemeralClient() created the client at import time, which gave `import beam_100k_bench` an unwanted side effect. Wrapped it in _get_bench_client() so the client is constructed on first use instead. Earlier comment claimed both longmemeval_bench.py and convomem_bench.py created their client at module load. Only longmemeval_bench.py does that. convomem_bench.py creates a fresh PersistentClient inside its evaluation function (the lazy pattern), so the consistency argument was wrong.

rohithzr · 2026-04-09T19:34:54Z

You're right and I was wrong. I checked the file: convomem_bench.py has used chromadb.PersistentClient(path=palace_path) inside its per-evaluation function since the original benchmark commit (0f8fa8c). It was never module-level. I pattern-matched from my local experiments instead of checking the actual code, and the consistency claim only ever held for longmemeval_bench.py.

Pushed 475d3fb with the lazy init pattern. _bench_client is now None at module load and gets constructed on first call to _get_bench_client(). Verified import beam_100k_bench has no side effects:

_bench_client at import time: None
_bench_client after _get_bench_client(): Client
cached on second call: True

Thanks for the correction. This is a better pattern, the only reason I argued against it was a wrong factual premise.

web3guru888

📊 Review of #168 — bench: add BEAM 100K benchmark (end-to-end answer quality)

Scope: +1020/−0 · 4 file(s)

.gitignore (modified: +1/−0)
benchmarks/_beam_utils.py (added: +190/−0)
benchmarks/beam_100k_bench.py (added: +791/−0)
benchmarks/convert_beam.py (added: +38/−0)

Technical Analysis

🔤 Embedding model configuration — verify dimensionality compatibility with existing ChromaDB collections
🪟 Windows compatibility — verify path handling works cross-platform

Issues

🔒 eval() — arbitrary code execution; use ast.literal_eval() or json.loads()

🔴 Changes requested — security concern(s) must be addressed before merge.

_{🏛️ Reviewed by MemPalace-AGI · Autonomous research system with perfect memory · Showcase: Truth Palace of Atlantis}

rohithzr mentioned this pull request Apr 8, 2026

BEAM 100K benchmark results - first end-to-end answer quality evaluation #125

Open

rohithzr added 2 commits April 8, 2026 07:32

dial481 mentioned this pull request Apr 8, 2026

Independent benchmark reproduction on M2 Ultra — raw confirms 96.6%, aaak/rooms regress #39

Open

Merge branch 'main' into bench/beam-100k

590f46c

rohithzr added 2 commits April 9, 2026 11:57

Merge remote-tracking branch 'origin/main' into bench/beam-100k

74032c2

web3guru888 suggested changes Apr 11, 2026

View reviewed changes

bensig changed the base branch from main to develop April 11, 2026 22:23

bensig requested review from bensig, igorls and milla-jovovich as code owners April 11, 2026 22:23

igorls added area/ci CI/CD and workflows performance Performance improvements labels Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench: add BEAM 100K benchmark (end-to-end answer quality)#168

bench: add BEAM 100K benchmark (end-to-end answer quality)#168
rohithzr wants to merge 7 commits intoMemPalace:developfrom
rohithzr:bench/beam-100k

rohithzr commented Apr 7, 2026

Uh oh!

bgauryy commented Apr 8, 2026

Uh oh!

rohithzr commented Apr 9, 2026 •

edited

Loading

Uh oh!

bgauryy commented Apr 9, 2026

Uh oh!

rohithzr commented Apr 9, 2026 •

edited

Loading

Uh oh!

web3guru888 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

rohithzr commented Apr 7, 2026

What this adds

How it works

What BEAM tests that existing benchmarks don't

Running it

Results (full run, 20 conversations)

Files

Uh oh!

bgauryy commented Apr 8, 2026

PR Review: bench: add BEAM 100K benchmark (end-to-end answer quality)

Executive Summary

Ratings

PR Health

High Priority Issues

🔗 #1: Massive code duplication between bench and converter

Medium Priority Issues

🐛 #2: Judge score double-quantization may mask LLM output issues

🚨 #3: SSL verification globally disabled at import time

⚡ #4: Module-level chromadb.EphemeralClient() executes on import

Low Priority Issues

🎨 #5: Hardcoded model version strings will go stale

🎨 #6: No-rubric fallback scoring is coarse

Flow Impact Analysis

Uh oh!

rohithzr commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fixed

Skipped (with rationale)

Uh oh!

bgauryy commented Apr 9, 2026

Uh oh!

rohithzr commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

web3guru888 left a comment

Choose a reason for hiding this comment

Technical Analysis

Issues

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

⚡ #4: Module-level `chromadb.EphemeralClient()` executes on import

rohithzr commented Apr 9, 2026 •

edited

Loading

rohithzr commented Apr 9, 2026 •

edited

Loading