Skip to content

bench: add BEAM 100K benchmark (end-to-end answer quality)#168

Open
rohithzr wants to merge 7 commits intoMemPalace:developfrom
rohithzr:bench/beam-100k
Open

bench: add BEAM 100K benchmark (end-to-end answer quality)#168
rohithzr wants to merge 7 commits intoMemPalace:developfrom
rohithzr:bench/beam-100k

Conversation

@rohithzr
Copy link
Copy Markdown

@rohithzr rohithzr commented Apr 7, 2026

What this adds

BEAM 100K benchmark runner (benchmarks/beam_100k_bench.py) and a standalone data conversion script (benchmarks/convert_beam.py).

This is the first benchmark in this repo that measures end-to-end answer quality, not retrieval recall.

How it works

Standard RAG evaluation using MemPalace's own patterns:

  1. Auto-downloads BEAM parquet from HuggingFace on first run, converts to JSON, caches locally
  2. Ingests all user messages into chromadb.EphemeralClient() (same as longmemeval_bench.py)
  3. Top-K retrieval per question via ChromaDB semantic search
  4. LLM synthesizes an answer from retrieved chunks
  5. Scores each rubric item using the official BEAM judge prompt (3-tier: 1.0 / 0.5 / 0.0)

Follows the same code patterns as existing benchmarks: _fresh_collection(), _make_embed_fn(), _bench_client, argparse conventions, JSONL debug output.

What BEAM tests that existing benchmarks don't

BEAM evaluates 10 distinct memory abilities across 20 conversations and 400 questions:

Ability What it tests
Information Extraction Factual recall from conversation history
Multi-Session Reasoning Combining information across sessions
Temporal Reasoning Date/duration calculations
Contradiction Resolution Detecting conflicting statements
Knowledge Update Tracking changed information
Preference Following Respecting stated preferences
Instruction Following Format/style adherence
Event Ordering Chronological reconstruction
Summarization Comprehensive synthesis
Abstention Refusing unanswerable questions

Contradiction resolution, summarization, event ordering, and abstention are not covered by any existing benchmark in this repo.

Running it

pip install openai pandas pyarrow

# Set LLM credentials
export OPENAI_API_KEY=...
# or Azure:
export AZURE_OPENAI_API_KEY=...
export AZURE_OPENAI_ENDPOINT=...

# Single conversation (quick test, ~2 min, auto-downloads dataset)
python benchmarks/beam_100k_bench.py

# Full 20-conversation run (~40 min)
python benchmarks/beam_100k_bench.py --full

# With pre-downloaded dataset
python benchmarks/beam_100k_bench.py data/beam-100k.json --full

# With alternative embeddings
python benchmarks/beam_100k_bench.py --embed-model bge-large --full

Results (full run, 20 conversations)

Config: ChromaDB default embeddings (all-MiniLM-L6-v2), top-10 retrieval, GPT-5.4-mini synthesis + judge, temperature 0.0.

Overall: 515/1051 rubric checks passed (49.0%)

Ability Passed Total Accuracy
Preference Following 59 74 80%
Abstention 28 40 70%
Temporal Reasoning 52 75 69%
Instruction Following 39 59 66%
Multi-Session Reasoning 63 101 62%
Information Extraction 53 92 58%
Knowledge Update 19 42 45%
Contradiction Resolution 64 160 40%
Summarization 69 195 35%
Event Ordering 69 213 32%

Abilities that reduce to "find the right chunk" score well. Abilities requiring cross-chunk reasoning (contradiction, summarization, event ordering) score lower, which is expected for a retrieval-only architecture.

Files

  • benchmarks/beam_100k_bench.py (728 lines) - benchmark runner with auto-download
  • benchmarks/convert_beam.py (129 lines) - standalone parquet-to-JSON converter
  • .gitignore - added benchmarks/.beam_cache/

Context: #125

End-to-end answer quality evaluation using the BEAM benchmark
(Tavakoli et al., 2024). Tests 10 memory abilities across 20
conversations and 400 questions with the official 3-tier rubric
judge (1.0 / 0.5 / 0.0).

Pipeline: ChromaDB retrieval -> LLM synthesis -> BEAM rubric judge.
Follows the same patterns as existing LongMemEval and LoCoMo benchmarks
(EphemeralClient, _fresh_collection, _make_embed_fn, argparse conventions).

Auto-downloads the BEAM parquet from HuggingFace on first run,
converts to JSON, and caches locally. Also includes convert_beam.py
for standalone manual conversion.

Requires: pip install openai pandas pyarrow
rohithzr added 2 commits April 8, 2026 07:32
Adds --mode hybrid (keyword overlap re-ranking, same logic as
longmemeval_bench.py build_palace_and_retrieve_hybrid) and
--llm-rerank (Claude Haiku reranking, same logic as
longmemeval_bench.py llm_rerank).

Results across all three modes on full 20-conversation run:
  raw:                  49.0% (515/1051)
  hybrid:               43.0% (452/1051)
  hybrid + haiku rerank: 43.6% (458/1051)
Adds --mode aaak (compress messages through mempalace.dialect.Dialect
before indexing) and --aaak-spec (include AAAK dialect specification
from mcp_server.py in the synthesis prompt).

AAAK is a novel approach to context compression. This implementation
is based on the code in dialect.py and the AAAK_SPEC in mcp_server.py.
If there are additional prompting strategies for helping the LLM read
AAAK format, those could improve these numbers.

Results across all five modes on full 20-conversation run:
  raw (k=10):                   49.0% (515/1051)
  hybrid (k=10):                43.0% (452/1051)
  hybrid + haiku rerank (k=10): 43.6% (458/1051)
  aaak (k=50):                  26.2% (275/1051)
  aaak + spec (k=50):           27.9% (293/1051)
@bgauryy
Copy link
Copy Markdown

bgauryy commented Apr 8, 2026

PR Review: bench: add BEAM 100K benchmark (end-to-end answer quality)

Executive Summary

Aspect Value
PR Goal Add BEAM 100K benchmark runner evaluating MemPalace as a RAG memory backend for end-to-end answer quality
Files Changed 3 (+1060 / -0)
Risk Level 🟢 LOW - New benchmark files, no production code touched
Review Effort 3 - Moderate (930-line benchmark + 129-line converter, heavy code duplication)
Recommendation 🔄 REQUEST_CHANGES

Affected Areas: benchmarks/ (new files), .gitignore (cache path)

Business Impact: Adds the first end-to-end answer quality benchmark (vs. retrieval-only). Enables measuring MemPalace RAG quality against BEAM's official rubric judge.

Flow Changes: None — purely additive. No changes to core MemPalace code.

Ratings

Aspect Score
Correctness 4/5
Security 3/5
Performance 3/5
Maintainability 2/5

PR Health

  • Has clear description
  • References ticket/issue (if applicable)
  • Appropriate size (or justified if large)
  • Has relevant tests (if applicable)

High Priority Issues

🔗 #1: Massive code duplication between bench and converter

Location: benchmarks/convert_beam.py:1-129 vs benchmarks/beam_100k_bench.py:63-175 | Confidence: ✅ HIGH

The entire convert_beam.py (129 lines) is a near-verbatim copy of _convert_parquet_to_json() + surrounding helpers from beam_100k_bench.py. The chat-turn parsing, time-anchor cleaning, question extraction, and output schema are duplicated line-for-line. Any future bug fix in parsing logic must be applied in two places.

Fix: Have convert_beam.py import and call the functions from beam_100k_bench.py, or extract the shared conversion logic into a benchmarks/_beam_utils.py module that both files import.

- # convert_beam.py: 100+ lines of duplicated parsing logic
+ from beam_100k_bench import _convert_parquet_to_json, _download_beam_parquet
+
+ def main():
+     input_file = sys.argv[1] if len(sys.argv) > 1 else "data/beam-100k.parquet"
+     output_file = sys.argv[2] if len(sys.argv) > 2 else "data/beam-100k.json"
+     _convert_parquet_to_json(input_file, output_file)

Medium Priority Issues

🐛 #2: Judge score double-quantization may mask LLM output issues

Location: benchmarks/beam_100k_bench.py:358-366 | Confidence: ⚠️ MED

The BEAM judge prompt instructs the LLM to return exactly 1.0, 0.5, or 0.0. But judge_rubric() re-thresholds the parsed score (>=0.75 → 1.0, >=0.25 → 0.5, else 0.0). If the LLM correctly returns 0.5, this survives. But if it returns an unexpected value like 0.7 (partial-ish), the double-quantization silently rounds it to 1.0 — inflating scores. The BEAM paper expects the judge to return only the three canonical values.

  try:
      parsed = json.loads(response)
      score = float(parsed.get("score", 0))
-     if score >= 0.75:
-         return 1.0
-     elif score >= 0.25:
-         return 0.5
-     else:
-         return 0.0
+     # BEAM rubric judge should return exactly 1.0, 0.5, or 0.0
+     if score in (1.0, 0.5, 0.0):
+         return score
+     # Fallback for unexpected values: snap to nearest tier
+     if score >= 0.75:
+         return 1.0
+     elif score >= 0.25:
+         return 0.5
+     return 0.0

🚨 #3: SSL verification globally disabled at import time

Location: benchmarks/beam_100k_bench.py:52 | Confidence: ⚠️ MED

ssl._create_default_https_context = ssl._create_unverified_context disables SSL certificate verification process-wide, not just for the BEAM download. Any subsequent HTTPS call (OpenAI API, Anthropic API) in the same process will skip cert verification. This is a known pattern from convomem_bench.py, but the LLM API calls in this benchmark make the attack surface wider.

Consider scoping the bypass to just the download function:

- ssl._create_default_https_context = ssl._create_unverified_context
  ...
  def _download_beam_parquet(cache_dir):
+     ctx = ssl.create_default_context()
+     ctx.check_hostname = False
+     ctx.verify_mode = ssl.CERT_NONE
      ...
-     urllib.request.urlretrieve(HF_BEAM_URL, parquet_path)
+     with urllib.request.urlopen(HF_BEAM_URL, context=ctx) as resp:
+         with open(parquet_path, "wb") as f:
+             f.write(resp.read())

#4: Module-level chromadb.EphemeralClient() executes on import

Location: benchmarks/beam_100k_bench.py:249 | Confidence: ⚠️ MED

_bench_client = chromadb.EphemeralClient() runs at import time, creating a ChromaDB client even if you only import a helper function. This matches longmemeval_bench.py's pattern, so it's consistent — but worth noting that it means import beam_100k_bench has side effects (allocates an in-memory DB).

Consider lazy initialization:

- _bench_client = chromadb.EphemeralClient()
+ _bench_client = None
+
+ def _get_client():
+     global _bench_client
+     if _bench_client is None:
+         _bench_client = chromadb.EphemeralClient()
+     return _bench_client

Low Priority Issues

🎨 #5: Hardcoded model version strings will go stale

Location: benchmarks/beam_100k_bench.py:36,211,395 | Confidence: ⚠️ MED

The default model is gpt-5.4-mini (Azure) and claude-haiku-4-5-20251001 (reranker). These will go stale as new model versions are released. Other benchmarks in the repo have the same pattern, but centralizing model defaults (or documenting them in BENCHMARKS.md) would help maintainability.


🎨 #6: No-rubric fallback scoring is coarse

Location: benchmarks/beam_100k_bench.py:447-454 | Confidence: ⚠️ MED

When a question has no rubric items, the code scores pass/fail based on len(answer) > 50. This arbitrary character threshold means a 51-character hallucination scores identically to a perfect answer. Consider at minimum logging a warning that unscored questions were encountered, or using the reference answer for a simple string comparison.


Flow Impact Analysis

No existing code paths are modified. The PR is purely additive:

benchmarks/beam_100k_bench.py (NEW)
  └── uses: chromadb.EphemeralClient (direct, not palace_db.py — consistent with other benchmarks)
  └── uses: mempalace.dialect.Dialect (optional AAAK mode)
  └── uses: openai.AzureOpenAI / openai.OpenAI (synthesis + judging)
  └── uses: Anthropic API via urllib (optional LLM reranking)

benchmarks/convert_beam.py (NEW)
  └── standalone CLI script, no MemPalace imports

.gitignore
  └── adds: benchmarks/.beam_cache/

Created by Octocode MCP https://octocode.ai 🔍🐙

rohithzr added 2 commits April 9, 2026 11:57
Three issues from the PR review have been fixed. Three were skipped
as false positives or matching existing repo patterns.

Fixed:

1. Code duplication between bench and converter (HIGH).
   Extracted shared parsing logic into benchmarks/_beam_utils.py.
   - convert_beam.py is now a 38-line wrapper.
   - beam_100k_bench.py imports from _beam_utils.
   - Net 54 lines removed, parsing logic exists in one place.

2. SSL verification scoped to BEAM download only (MED).
   Removed module-level ssl._create_default_https_context bypass that
   disabled cert verification process-wide. SSL is now skipped only
   inside download_beam_parquet() via a per-request SSLContext, so
   LLM API calls (OpenAI, Anthropic) keep default cert verification.

3. Removed misleading no-rubric fallback (LOW).
   The previous fallback scored answers as PASS based on len(answer) > 50,
   which would have inflated scores on hallucinated answers if a question
   ever shipped without a rubric. BEAM 100K has 0/400 such questions, but
   the new behavior prints [SKIP] and excludes the question from totals.

Skipped (with rationale):

- Judge double-quantization: intentional, matches canonical BEAM Rust
  implementation. LLMs occasionally return non-canonical scores like 0.7
  and snapping to nearest tier is correct BEAM methodology.
- Lazy ChromaDB init: matches existing pattern in longmemeval_bench.py
  and convomem_bench.py. Changing one file breaks consistency.
- Hardcoded model strings: already overridable via AZURE_OPENAI_CHAT_MODEL
  env var and --llm-model CLI flag.
@rohithzr
Copy link
Copy Markdown
Author

rohithzr commented Apr 9, 2026

Thanks for the review @bgauryy. Pushed a follow-up commit (8dfae0d) addressing the issues that hold up under inspection. Going through each finding:

Fixed

Code duplication (HIGH). Real issue. Extracted shared parsing logic into benchmarks/_beam_utils.py. convert_beam.py is now a 38-line wrapper that calls convert_parquet_to_json(). beam_100k_bench.py imports ensure_beam_dataset() from the same module. Net 54 lines removed, parsing logic exists in one place. Future bug fixes only need to land in _beam_utils.py.

SSL verification scoped (MED). Real issue. Removed the module-level ssl._create_default_https_context = ssl._create_unverified_context bypass that disabled certificate verification process-wide. SSL is now skipped only inside download_beam_parquet() via a per-request SSLContext, so LLM API calls (OpenAI, Anthropic) keep default certificate verification. This is a tighter scope than the existing convomem_bench.py pattern, intentionally.

No-rubric fallback (LOW). Real issue. The previous fallback scored answers as PASS based on len(answer) > 50, which would have inflated scores on hallucinated 51-character answers if a question shipped without a rubric. Verified 0/400 BEAM questions trigger this branch, but the new behavior prints [SKIP] and excludes the question from totals. Safer if a future dataset variant ever ships unscored questions.

Skipped (with rationale)

Judge double-quantization. This one looks like a bug but isn't. The threshold snapping is intentional and matches the canonical BEAM Rust reference implementation in Karta's beam_100k.rs. LLMs occasionally return non-canonical scores like 0.7 despite the prompt asking for exactly 1.0/0.5/0.0, and snapping to the nearest tier is the standard BEAM normalization. The proposed "fix" produces identical behavior in the well-behaved case and the same fallback in the misbehaved case, so it would just add code without changing scores.

Lazy ChromaDB initialization. Matches the existing pattern in longmemeval_bench.py and convomem_bench.py, both of which create the EphemeralClient at module load. Changing it for one file breaks consistency without solving a real problem (the benchmark scripts are entry points, not libraries that get imported as helpers).

Hardcoded model strings. Already overridable via AZURE_OPENAI_CHAT_MODEL env var, OPENAI_CHAT_MODEL env var, and --llm-model CLI flag. The hardcoded values are documented defaults, not constraints.

The branch is also synced with origin/main (75 commits, no conflicts).

@bgauryy
Copy link
Copy Markdown

bgauryy commented Apr 9, 2026

Thanks for the follow-up @rohithzr
I verified each claim against the code on the branch. The three fixes are solid and the commit is clean. Quick note on one point:

ChromaDB lazy init

You mention this matches longmemeval_bench.py and convomem_bench.py, both creating EphemeralClient at module load. That's accurate for longmemeval_bench.py, but convomem_bench.py actually uses chromadb.PersistentClient(path=...) inside a function (~line 181) -- which is the lazy pattern I originally suggested.

Not a blocker -- these scripts are entry points, not importable libraries, so the practical risk is low. Just wanted to correct the record since the consistency argument only holds for one of the two benchmarks cited.

Everything else checks out. The Karta reference impl confirms the judge snapping, SSL scoping is clean, and the _beam_utils.py extraction is well done. LGTM.

Module-level chromadb.EphemeralClient() created the client at import
time, which gave `import beam_100k_bench` an unwanted side effect.
Wrapped it in _get_bench_client() so the client is constructed on
first use instead.

Earlier comment claimed both longmemeval_bench.py and convomem_bench.py
created their client at module load. Only longmemeval_bench.py does that.
convomem_bench.py creates a fresh PersistentClient inside its evaluation
function (the lazy pattern), so the consistency argument was wrong.
@rohithzr
Copy link
Copy Markdown
Author

rohithzr commented Apr 9, 2026

You're right and I was wrong. I checked the file: convomem_bench.py has used chromadb.PersistentClient(path=palace_path) inside its per-evaluation function since the original benchmark commit (0f8fa8c). It was never module-level. I pattern-matched from my local experiments instead of checking the actual code, and the consistency claim only ever held for longmemeval_bench.py.

Pushed 475d3fb with the lazy init pattern. _bench_client is now None at module load and gets constructed on first call to _get_bench_client(). Verified import beam_100k_bench has no side effects:

_bench_client at import time: None
_bench_client after _get_bench_client(): Client
cached on second call: True

Thanks for the correction. This is a better pattern, the only reason I argued against it was a wrong factual premise.

Copy link
Copy Markdown

@web3guru888 web3guru888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📊 Review of #168bench: add BEAM 100K benchmark (end-to-end answer quality)

Scope: +1020/−0 · 4 file(s)

  • .gitignore (modified: +1/−0)
  • benchmarks/_beam_utils.py (added: +190/−0)
  • benchmarks/beam_100k_bench.py (added: +791/−0)
  • benchmarks/convert_beam.py (added: +38/−0)

Technical Analysis

  • 🔤 Embedding model configuration — verify dimensionality compatibility with existing ChromaDB collections
  • 🪟 Windows compatibility — verify path handling works cross-platform

Issues

  • 🔒 eval() — arbitrary code execution; use ast.literal_eval() or json.loads()

🔴 Changes requested — security concern(s) must be addressed before merge.


🏛️ Reviewed by MemPalace-AGI · Autonomous research system with perfect memory · Showcase: Truth Palace of Atlantis

@bensig bensig changed the base branch from main to develop April 11, 2026 22:23
@igorls igorls added area/ci CI/CD and workflows performance Performance improvements labels Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/ci CI/CD and workflows performance Performance improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants