bench: add scale benchmark suite (106 tests)#223
Conversation
Benchmark mempalace at configurable scale (1K–100K drawers) to find real-world performance limits. Tests cover MCP tool OOM thresholds, ChromaDB query degradation, search recall@k, mining throughput, knowledge graph concurrency, memory leak detection, palace boost quantification, and Layer1 unbounded fetch behavior. - tests/benchmarks/ with 8 test modules + data generator + report system - Deterministic data factory with planted needles for recall measurement - JSON report output with regression detection (--bench-report flag) - CI benchmark job on PRs at small scale - psutil added as dev dependency for RSS tracking
Concentrates all drawers into a single wing+room to isolate the embedding model's retrieval limit independent of palace filtering. Confirms recall degrades to ~0.4-0.5 at 5K drawers per room even with wing+room filters applied — the spatial structure helps by keeping buckets small, but can't fix the underlying embedding ceiling.
Remove unused imports (shutil, string, datetime, os, yaml, time, SCALE_CONFIGS) and unused variable assignments in timing-only calls.
There was a problem hiding this comment.
Pull request overview
This PR introduces a dedicated scale/performance benchmark suite to measure MemPalace behavior as data volumes grow, and wires it into CI (PR-only) with JSON reporting support.
Changes:
- Added a new
tests/benchmarks/suite (multiple modules) covering search latency/recall, MCP tool behavior, ChromaDB stress, ingestion throughput, memory/RSS profiling, layers/wake-up cost, and KG concurrency/perf. - Added a deterministic benchmark data generator with planted “needles” plus JSON metric recording and report/regression utilities.
- Updated packaging/CI to support benchmarks (adds
psutilto dev deps, registers pytest markers, and adds a PR-only benchmark GitHub Actions job).
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 22 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/benchmarks/test_search_bench.py | Search latency/recall/concurrency and n_results scaling benchmarks |
| tests/benchmarks/test_recall_threshold.py | Single-room “bucket size” recall ceiling benchmark |
| tests/benchmarks/test_palace_boost.py | Quantifies recall/latency impact of wing/room filtering (“palace boost”) |
| tests/benchmarks/test_memory_profile.py | RSS and heap snapshot benchmarks for high-risk paths |
| tests/benchmarks/test_mcp_bench.py | MCP tool latency/RSS benchmarks and client re-instantiation cost |
| tests/benchmarks/test_layers_bench.py | Wake-up/layer performance and token budget checks |
| tests/benchmarks/test_knowledge_graph_bench.py | KG throughput/latency/temporal correctness and SQLite concurrency benchmarks |
| tests/benchmarks/test_ingest_bench.py | Mining/chunking throughput, peak RSS, and re-ingest skip overhead benchmarks |
| tests/benchmarks/test_chromadb_stress.py | ChromaDB stress/degradation + bulk insert comparisons |
| tests/benchmarks/data_generator.py | Seeded data factory + planted needles and KG triple generation |
| tests/benchmarks/report.py | Metric recording and regression comparison utilities |
| tests/benchmarks/conftest.py | Benchmark pytest options/markers/report writing hook |
| tests/benchmarks/README.md | Benchmark suite documentation and run instructions |
| tests/benchmarks/init.py | Benchmark package marker |
| pyproject.toml | Adds psutil to dev deps + pytest markers/pythonpath |
| .github/workflows/ci.yml | Runs unit tests excluding benchmarks; adds PR-only benchmark job + artifact upload |
Comments suppressed due to low confidence (1)
tests/benchmarks/test_knowledge_graph_bench.py:281
resultis assigned but never used; CI runsruff check .so this will fail with F841. Remove the assignment or use the result (e.g., assert it’s a dict with expected keys).
elapsed_ms = (time.perf_counter() - start) * 1000
latencies.append(elapsed_ms)
avg_ms = sum(latencies) / len(latencies)
record_metric("kg_stats", f"avg_ms_at_{n_triples}", round(avg_ms, 2))
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| import json | ||
| import os | ||
| import tempfile | ||
|
|
||
|
|
There was a problem hiding this comment.
datetime is imported but never used; ruff check . in CI will fail with F401. Remove the unused import (or use it if intended).
| higher_is_worse = { | ||
| "latency", "rss", "memory", "oom", "lock_failures", "elapsed", | ||
| "p50_ms", "p95_ms", "p99_ms", "rss_delta_mb", "peak_rss_mb", | ||
| } | ||
| # Metrics where LOWER is worse (throughput, recall, etc.) | ||
| lower_is_worse = { | ||
| "recall", "throughput", "per_sec", "files_per_sec", "drawers_per_sec", | ||
| "triples_per_sec", "improvement", | ||
| } | ||
|
|
||
| for category in baseline.get("results", {}): | ||
| if category not in current.get("results", {}): | ||
| continue | ||
| for metric, base_val in baseline["results"][category].items(): | ||
| if metric not in current["results"][category]: | ||
| continue | ||
| curr_val = current["results"][category][metric] | ||
| if not isinstance(base_val, (int, float)) or not isinstance(curr_val, (int, float)): | ||
| continue | ||
| if base_val == 0: | ||
| continue | ||
|
|
||
| # Determine direction | ||
| is_latency_like = any(kw in metric.lower() for kw in higher_is_worse) | ||
| is_throughput_like = any(kw in metric.lower() for kw in lower_is_worse) | ||
|
|
||
| if is_latency_like: | ||
| # Higher is worse — check if current exceeds baseline by threshold | ||
| if curr_val > base_val * (1 + threshold): | ||
| pct = ((curr_val - base_val) / base_val) * 100 | ||
| regressions.append( | ||
| f"{category}/{metric}: {base_val:.2f} -> {curr_val:.2f} ({pct:+.1f}%, threshold {threshold*100:.0f}%)" | ||
| ) | ||
| elif is_throughput_like: | ||
| # Lower is worse — check if current is below baseline by threshold |
There was a problem hiding this comment.
check_regression() classifies metrics by substring matching. A metric like latency_improvement_pct matches both “latency” (treated as higher-is-worse) and “improvement” (higher-is-better), but the current logic prioritizes the latency branch, so improved latency can be flagged as a regression. Consider an explicit per-metric direction map (or precedence rules), or rename/namespace improvement metrics so they don’t include latency in the key.
| import json | ||
| import os | ||
| import tempfile | ||
|
|
||
| import pytest |
There was a problem hiding this comment.
shutil is imported but never used; this will fail ruff check . (F401). Remove the unused import.
| # The results are written by individual tests via bench_results fixture | ||
| import platform | ||
| import subprocess | ||
|
|
||
| try: | ||
| git_sha = subprocess.check_output( |
There was a problem hiding this comment.
The comments in pytest_terminal_summary say results are written via the bench_results fixture, but the implementation actually reads the temp JSON file written by record_metric(). This mismatch is confusing for future maintenance; either wire tests to use bench_results or update/remove the stale comments/fixture.
| import hashlib | ||
| import os | ||
| import random | ||
| from datetime import datetime, timedelta | ||
| from pathlib import Path | ||
|
|
There was a problem hiding this comment.
string is imported but never used; this will fail ruff check . (F401). Remove the unused import.
| start = time.perf_counter() | ||
| kg.timeline() | ||
| elapsed_ms = (time.perf_counter() - start) * 1000 | ||
| latencies.append(elapsed_ms) | ||
|
|
||
| avg_ms = sum(latencies) / len(latencies) |
There was a problem hiding this comment.
result is assigned but never used; CI runs ruff check . so this will fail with F841. Remove the assignment or use the result (e.g., assert it returns <= 100 items).
| result_march = kg.query_entity("Alice", as_of="2024-03-15") | ||
| # Query Alice as of September 2024 — should find ProjectB | ||
| result_sept = kg.query_entity("Alice", as_of="2024-09-15") | ||
|
|
There was a problem hiding this comment.
project_names is computed but never used; CI runs ruff check . so this will fail with F841. Remove it or use it for a basic correctness assertion (e.g., that ProjectA appears for the March query).
| march_project_names = { | |
| item.get("object") | |
| for item in result_march | |
| if isinstance(item, dict) and item.get("predicate") == "works_on" | |
| } if isinstance(result_march, list) else set() | |
| sept_project_names = { | |
| item.get("object") | |
| for item in result_sept | |
| if isinstance(item, dict) and item.get("predicate") == "works_on" | |
| } if isinstance(result_sept, list) else set() | |
| assert "ProjectA" in march_project_names | |
| assert "ProjectB" in sept_project_names |
| [tool.pytest.ini_options] | ||
| testpaths = ["tests"] | ||
| pythonpath = ["."] | ||
| markers = [ | ||
| "benchmark: scale/performance benchmark tests", | ||
| "slow: tests that take more than 30 seconds", | ||
| "stress: destructive scale tests (100K+ drawers)", | ||
| ] |
There was a problem hiding this comment.
Adding 106 benchmark tests under tests/ means a plain pytest run will now execute long-running benchmarks unless users remember to filter/ignore them. Consider setting a default addopts to skip benchmark/slow/stress markers (and require -m benchmark to run them), or moving benchmarks outside the default test discovery path.
| kg = KnowledgeGraph(db_path=str(tmp_path / "kg.sqlite3")) | ||
|
|
||
| # Create a hub entity connected to many others |
There was a problem hiding this comment.
gen is assigned but never used; CI runs ruff check . so this will fail with F841. Remove the unused variable assignment (or use it if you intended to generate additional data).
| for i in range(100): | ||
| kg.add_entity(f"Entity_{i}", "concept") | ||
|
|
There was a problem hiding this comment.
gen is assigned but never used; CI runs ruff check . so this will fail with F841. Remove the unused variable assignment.
- Run ruff format on all benchmark files (fixes CI lint job) - Fix check_regression() substring ambiguity: ordered keyword matching so "latency_improvement_pct" is correctly classified as higher-is-better - Update stale comments in conftest.py referencing wrong fixture - Add pytest addopts to skip benchmark/slow/stress markers by default
Too heavy for CI (~2h per run). Benchmarks can be run locally with: pytest -m benchmark --bench-scale=small --bench-report=results.json
PR Review: bench: add scale benchmark suite (106 tests)Executive Summary
Affected Areas: Business Impact: Enables data-driven performance decisions. Key findings already surfaced (recall ceiling at ~1K drawers/bucket, client re-instantiation overhead, unbounded metadata fetch). Flow Changes: No production flow changes. Adds pytest markers, CLI options, and a Ratings
PR Health
High Priority Issues(Must fix before merge) 🐛 #1:
|
Merged both the PR's benchmark suite additions (psutil dep, pytest markers, --ignore=tests/benchmarks) and upstream's coverage changes (pytest-cov, --cov-fail-under=30, coverage config) so both coexist. Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Summary
tests/benchmarks/(9 test modules, 106 tests)--bench-reportflag)psutiladded as dev dependency for RSS trackingKey findings from initial runs
search_memories()RSS growth dropped from 43 MB/100 calls to 0.09 MB/100 calls after PR fix: add limit=10000 safety cap to all unbounded ChromaDB .get() calls #137.PersistentClientper call.tool_status()andLayer1.generate()load all metadata linearly (57ms + 2.7MB at 5K drawers).Test plan
uv run pytest tests/benchmarks/ -v --bench-scale=small— 106 passed, 0 faileduv run pytest tests/ -v --ignore=tests/benchmarks— existing 20 tests unaffected