bench: add BEAM 100K benchmark (end-to-end answer quality)#168
bench: add BEAM 100K benchmark (end-to-end answer quality)#168rohithzr wants to merge 7 commits intoMemPalace:developfrom
Conversation
End-to-end answer quality evaluation using the BEAM benchmark (Tavakoli et al., 2024). Tests 10 memory abilities across 20 conversations and 400 questions with the official 3-tier rubric judge (1.0 / 0.5 / 0.0). Pipeline: ChromaDB retrieval -> LLM synthesis -> BEAM rubric judge. Follows the same patterns as existing LongMemEval and LoCoMo benchmarks (EphemeralClient, _fresh_collection, _make_embed_fn, argparse conventions). Auto-downloads the BEAM parquet from HuggingFace on first run, converts to JSON, and caches locally. Also includes convert_beam.py for standalone manual conversion. Requires: pip install openai pandas pyarrow
Adds --mode hybrid (keyword overlap re-ranking, same logic as longmemeval_bench.py build_palace_and_retrieve_hybrid) and --llm-rerank (Claude Haiku reranking, same logic as longmemeval_bench.py llm_rerank). Results across all three modes on full 20-conversation run: raw: 49.0% (515/1051) hybrid: 43.0% (452/1051) hybrid + haiku rerank: 43.6% (458/1051)
Adds --mode aaak (compress messages through mempalace.dialect.Dialect before indexing) and --aaak-spec (include AAAK dialect specification from mcp_server.py in the synthesis prompt). AAAK is a novel approach to context compression. This implementation is based on the code in dialect.py and the AAAK_SPEC in mcp_server.py. If there are additional prompting strategies for helping the LLM read AAAK format, those could improve these numbers. Results across all five modes on full 20-conversation run: raw (k=10): 49.0% (515/1051) hybrid (k=10): 43.0% (452/1051) hybrid + haiku rerank (k=10): 43.6% (458/1051) aaak (k=50): 26.2% (275/1051) aaak + spec (k=50): 27.9% (293/1051)
PR Review: bench: add BEAM 100K benchmark (end-to-end answer quality)Executive Summary
Affected Areas: Business Impact: Adds the first end-to-end answer quality benchmark (vs. retrieval-only). Enables measuring MemPalace RAG quality against BEAM's official rubric judge. Flow Changes: None — purely additive. No changes to core MemPalace code. Ratings
PR Health
High Priority Issues🔗 #1: Massive code duplication between bench and converterLocation: The entire Fix: Have - # convert_beam.py: 100+ lines of duplicated parsing logic
+ from beam_100k_bench import _convert_parquet_to_json, _download_beam_parquet
+
+ def main():
+ input_file = sys.argv[1] if len(sys.argv) > 1 else "data/beam-100k.parquet"
+ output_file = sys.argv[2] if len(sys.argv) > 2 else "data/beam-100k.json"
+ _convert_parquet_to_json(input_file, output_file)Medium Priority Issues🐛 #2: Judge score double-quantization may mask LLM output issuesLocation: The BEAM judge prompt instructs the LLM to return exactly try:
parsed = json.loads(response)
score = float(parsed.get("score", 0))
- if score >= 0.75:
- return 1.0
- elif score >= 0.25:
- return 0.5
- else:
- return 0.0
+ # BEAM rubric judge should return exactly 1.0, 0.5, or 0.0
+ if score in (1.0, 0.5, 0.0):
+ return score
+ # Fallback for unexpected values: snap to nearest tier
+ if score >= 0.75:
+ return 1.0
+ elif score >= 0.25:
+ return 0.5
+ return 0.0🚨 #3: SSL verification globally disabled at import timeLocation:
Consider scoping the bypass to just the download function: - ssl._create_default_https_context = ssl._create_unverified_context
...
def _download_beam_parquet(cache_dir):
+ ctx = ssl.create_default_context()
+ ctx.check_hostname = False
+ ctx.verify_mode = ssl.CERT_NONE
...
- urllib.request.urlretrieve(HF_BEAM_URL, parquet_path)
+ with urllib.request.urlopen(HF_BEAM_URL, context=ctx) as resp:
+ with open(parquet_path, "wb") as f:
+ f.write(resp.read())⚡ #4: Module-level
|
Three issues from the PR review have been fixed. Three were skipped as false positives or matching existing repo patterns. Fixed: 1. Code duplication between bench and converter (HIGH). Extracted shared parsing logic into benchmarks/_beam_utils.py. - convert_beam.py is now a 38-line wrapper. - beam_100k_bench.py imports from _beam_utils. - Net 54 lines removed, parsing logic exists in one place. 2. SSL verification scoped to BEAM download only (MED). Removed module-level ssl._create_default_https_context bypass that disabled cert verification process-wide. SSL is now skipped only inside download_beam_parquet() via a per-request SSLContext, so LLM API calls (OpenAI, Anthropic) keep default cert verification. 3. Removed misleading no-rubric fallback (LOW). The previous fallback scored answers as PASS based on len(answer) > 50, which would have inflated scores on hallucinated answers if a question ever shipped without a rubric. BEAM 100K has 0/400 such questions, but the new behavior prints [SKIP] and excludes the question from totals. Skipped (with rationale): - Judge double-quantization: intentional, matches canonical BEAM Rust implementation. LLMs occasionally return non-canonical scores like 0.7 and snapping to nearest tier is correct BEAM methodology. - Lazy ChromaDB init: matches existing pattern in longmemeval_bench.py and convomem_bench.py. Changing one file breaks consistency. - Hardcoded model strings: already overridable via AZURE_OPENAI_CHAT_MODEL env var and --llm-model CLI flag.
|
Thanks for the review @bgauryy. Pushed a follow-up commit (8dfae0d) addressing the issues that hold up under inspection. Going through each finding: FixedCode duplication (HIGH). Real issue. Extracted shared parsing logic into SSL verification scoped (MED). Real issue. Removed the module-level No-rubric fallback (LOW). Real issue. The previous fallback scored answers as PASS based on Skipped (with rationale)Judge double-quantization. This one looks like a bug but isn't. The threshold snapping is intentional and matches the canonical BEAM Rust reference implementation in Karta's Lazy ChromaDB initialization. Matches the existing pattern in Hardcoded model strings. Already overridable via The branch is also synced with |
|
Thanks for the follow-up @rohithzr ChromaDB lazy init You mention this matches Not a blocker -- these scripts are entry points, not importable libraries, so the practical risk is low. Just wanted to correct the record since the consistency argument only holds for one of the two benchmarks cited. Everything else checks out. The Karta reference impl confirms the judge snapping, SSL scoping is clean, and the |
Module-level chromadb.EphemeralClient() created the client at import time, which gave `import beam_100k_bench` an unwanted side effect. Wrapped it in _get_bench_client() so the client is constructed on first use instead. Earlier comment claimed both longmemeval_bench.py and convomem_bench.py created their client at module load. Only longmemeval_bench.py does that. convomem_bench.py creates a fresh PersistentClient inside its evaluation function (the lazy pattern), so the consistency argument was wrong.
|
You're right and I was wrong. I checked the file: Pushed Thanks for the correction. This is a better pattern, the only reason I argued against it was a wrong factual premise. |
web3guru888
left a comment
There was a problem hiding this comment.
📊 Review of #168 — bench: add BEAM 100K benchmark (end-to-end answer quality)
Scope: +1020/−0 · 4 file(s)
.gitignore(modified: +1/−0)benchmarks/_beam_utils.py(added: +190/−0)benchmarks/beam_100k_bench.py(added: +791/−0)benchmarks/convert_beam.py(added: +38/−0)
Technical Analysis
- 🔤 Embedding model configuration — verify dimensionality compatibility with existing ChromaDB collections
- 🪟 Windows compatibility — verify path handling works cross-platform
Issues
- 🔒
eval()— arbitrary code execution; useast.literal_eval()orjson.loads()
🔴 Changes requested — security concern(s) must be addressed before merge.
🏛️ Reviewed by MemPalace-AGI · Autonomous research system with perfect memory · Showcase: Truth Palace of Atlantis
What this adds
BEAM 100K benchmark runner (
benchmarks/beam_100k_bench.py) and a standalone data conversion script (benchmarks/convert_beam.py).This is the first benchmark in this repo that measures end-to-end answer quality, not retrieval recall.
How it works
Standard RAG evaluation using MemPalace's own patterns:
chromadb.EphemeralClient()(same aslongmemeval_bench.py)Follows the same code patterns as existing benchmarks:
_fresh_collection(),_make_embed_fn(),_bench_client, argparse conventions, JSONL debug output.What BEAM tests that existing benchmarks don't
BEAM evaluates 10 distinct memory abilities across 20 conversations and 400 questions:
Contradiction resolution, summarization, event ordering, and abstention are not covered by any existing benchmark in this repo.
Running it
Results (full run, 20 conversations)
Config: ChromaDB default embeddings (all-MiniLM-L6-v2), top-10 retrieval, GPT-5.4-mini synthesis + judge, temperature 0.0.
Overall: 515/1051 rubric checks passed (49.0%)
Abilities that reduce to "find the right chunk" score well. Abilities requiring cross-chunk reasoning (contradiction, summarization, event ordering) score lower, which is expected for a retrieval-only architecture.
Files
benchmarks/beam_100k_bench.py(728 lines) - benchmark runner with auto-downloadbenchmarks/convert_beam.py(129 lines) - standalone parquet-to-JSON converter.gitignore- addedbenchmarks/.beam_cache/Context: #125