Skip to content

bench: add scale benchmark suite (106 tests)#223

Merged
bensig merged 6 commits intoMemPalace:mainfrom
igorls:bench/scale-test-suite
Apr 8, 2026
Merged

bench: add scale benchmark suite (106 tests)#223
bensig merged 6 commits intoMemPalace:mainfrom
igorls:bench/scale-test-suite

Conversation

@igorls
Copy link
Copy Markdown
Member

@igorls igorls commented Apr 8, 2026

Summary

  • Adds a full scale benchmark suite under tests/benchmarks/ (9 test modules, 106 tests)
  • Benchmarks cover MCP tool OOM thresholds, ChromaDB query degradation, search recall@k, mining throughput, knowledge graph concurrency, memory leak detection, palace boost quantification, and per-room recall ceiling
  • Deterministic data generator with planted needles for recall measurement without an LLM judge
  • JSON report output with regression detection (--bench-report flag)
  • CI benchmark job on PRs at small scale
  • psutil added as dev dependency for RSS tracking

Key findings from initial runs

  • Recall degrades past ~1K drawers per bucket — drops to 0.4-0.5 at 5K regardless of wing/room filtering. The palace structure helps by keeping buckets small, but can't fix the underlying embedding model ceiling.
  • Memory leak fixed upstreamsearch_memories() RSS growth dropped from 43 MB/100 calls to 0.09 MB/100 calls after PR fix: add limit=10000 safety cap to all unbounded ChromaDB .get() calls #137.
  • Client re-instantiation — 3.7x overhead from creating a new PersistentClient per call.
  • Unbounded metadata fetchtool_status() and Layer1.generate() load all metadata linearly (57ms + 2.7MB at 5K drawers).
  • KG insertion — ~105 triples/sec after hardening PR; concurrent access works (0 lock failures).

Test plan

igorls added 2 commits April 8, 2026 05:06
Benchmark mempalace at configurable scale (1K–100K drawers) to find
real-world performance limits. Tests cover MCP tool OOM thresholds,
ChromaDB query degradation, search recall@k, mining throughput,
knowledge graph concurrency, memory leak detection, palace boost
quantification, and Layer1 unbounded fetch behavior.

- tests/benchmarks/ with 8 test modules + data generator + report system
- Deterministic data factory with planted needles for recall measurement
- JSON report output with regression detection (--bench-report flag)
- CI benchmark job on PRs at small scale
- psutil added as dev dependency for RSS tracking
Concentrates all drawers into a single wing+room to isolate the
embedding model's retrieval limit independent of palace filtering.
Confirms recall degrades to ~0.4-0.5 at 5K drawers per room even
with wing+room filters applied — the spatial structure helps by
keeping buckets small, but can't fix the underlying embedding ceiling.
Copilot AI review requested due to automatic review settings April 8, 2026 08:06
Remove unused imports (shutil, string, datetime, os, yaml, time,
SCALE_CONFIGS) and unused variable assignments in timing-only calls.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a dedicated scale/performance benchmark suite to measure MemPalace behavior as data volumes grow, and wires it into CI (PR-only) with JSON reporting support.

Changes:

  • Added a new tests/benchmarks/ suite (multiple modules) covering search latency/recall, MCP tool behavior, ChromaDB stress, ingestion throughput, memory/RSS profiling, layers/wake-up cost, and KG concurrency/perf.
  • Added a deterministic benchmark data generator with planted “needles” plus JSON metric recording and report/regression utilities.
  • Updated packaging/CI to support benchmarks (adds psutil to dev deps, registers pytest markers, and adds a PR-only benchmark GitHub Actions job).

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 22 comments.

Show a summary per file
File Description
tests/benchmarks/test_search_bench.py Search latency/recall/concurrency and n_results scaling benchmarks
tests/benchmarks/test_recall_threshold.py Single-room “bucket size” recall ceiling benchmark
tests/benchmarks/test_palace_boost.py Quantifies recall/latency impact of wing/room filtering (“palace boost”)
tests/benchmarks/test_memory_profile.py RSS and heap snapshot benchmarks for high-risk paths
tests/benchmarks/test_mcp_bench.py MCP tool latency/RSS benchmarks and client re-instantiation cost
tests/benchmarks/test_layers_bench.py Wake-up/layer performance and token budget checks
tests/benchmarks/test_knowledge_graph_bench.py KG throughput/latency/temporal correctness and SQLite concurrency benchmarks
tests/benchmarks/test_ingest_bench.py Mining/chunking throughput, peak RSS, and re-ingest skip overhead benchmarks
tests/benchmarks/test_chromadb_stress.py ChromaDB stress/degradation + bulk insert comparisons
tests/benchmarks/data_generator.py Seeded data factory + planted needles and KG triple generation
tests/benchmarks/report.py Metric recording and regression comparison utilities
tests/benchmarks/conftest.py Benchmark pytest options/markers/report writing hook
tests/benchmarks/README.md Benchmark suite documentation and run instructions
tests/benchmarks/init.py Benchmark package marker
pyproject.toml Adds psutil to dev deps + pytest markers/pythonpath
.github/workflows/ci.yml Runs unit tests excluding benchmarks; adds PR-only benchmark job + artifact upload
Comments suppressed due to low confidence (1)

tests/benchmarks/test_knowledge_graph_bench.py:281

  • result is assigned but never used; CI runs ruff check . so this will fail with F841. Remove the assignment or use the result (e.g., assert it’s a dict with expected keys).
            elapsed_ms = (time.perf_counter() - start) * 1000
            latencies.append(elapsed_ms)

        avg_ms = sum(latencies) / len(latencies)
        record_metric("kg_stats", f"avg_ms_at_{n_triples}", round(avg_ms, 2))


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +8 to +12
import json
import os
import tempfile


Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datetime is imported but never used; ruff check . in CI will fail with F401. Remove the unused import (or use it if intended).

Copilot uses AI. Check for mistakes.
Comment thread tests/benchmarks/report.py Outdated
Comment on lines +49 to +83
higher_is_worse = {
"latency", "rss", "memory", "oom", "lock_failures", "elapsed",
"p50_ms", "p95_ms", "p99_ms", "rss_delta_mb", "peak_rss_mb",
}
# Metrics where LOWER is worse (throughput, recall, etc.)
lower_is_worse = {
"recall", "throughput", "per_sec", "files_per_sec", "drawers_per_sec",
"triples_per_sec", "improvement",
}

for category in baseline.get("results", {}):
if category not in current.get("results", {}):
continue
for metric, base_val in baseline["results"][category].items():
if metric not in current["results"][category]:
continue
curr_val = current["results"][category][metric]
if not isinstance(base_val, (int, float)) or not isinstance(curr_val, (int, float)):
continue
if base_val == 0:
continue

# Determine direction
is_latency_like = any(kw in metric.lower() for kw in higher_is_worse)
is_throughput_like = any(kw in metric.lower() for kw in lower_is_worse)

if is_latency_like:
# Higher is worse — check if current exceeds baseline by threshold
if curr_val > base_val * (1 + threshold):
pct = ((curr_val - base_val) / base_val) * 100
regressions.append(
f"{category}/{metric}: {base_val:.2f} -> {curr_val:.2f} ({pct:+.1f}%, threshold {threshold*100:.0f}%)"
)
elif is_throughput_like:
# Lower is worse — check if current is below baseline by threshold
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_regression() classifies metrics by substring matching. A metric like latency_improvement_pct matches both “latency” (treated as higher-is-worse) and “improvement” (higher-is-better), but the current logic prioritizes the latency branch, so improved latency can be flagged as a regression. Consider an explicit per-metric direction map (or precedence rules), or rename/namespace improvement metrics so they don’t include latency in the key.

Copilot uses AI. Check for mistakes.
Comment on lines +3 to +7
import json
import os
import tempfile

import pytest
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shutil is imported but never used; this will fail ruff check . (F401). Remove the unused import.

Copilot uses AI. Check for mistakes.
Comment thread tests/benchmarks/conftest.py Outdated
Comment on lines +100 to +105
# The results are written by individual tests via bench_results fixture
import platform
import subprocess

try:
git_sha = subprocess.check_output(
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments in pytest_terminal_summary say results are written via the bench_results fixture, but the implementation actually reads the temp JSON file written by record_metric(). This mismatch is confusing for future maintenance; either wire tests to use bench_results or update/remove the stale comments/fixture.

Copilot uses AI. Check for mistakes.
Comment on lines +10 to +15
import hashlib
import os
import random
from datetime import datetime, timedelta
from pathlib import Path

Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string is imported but never used; this will fail ruff check . (F401). Remove the unused import.

Copilot uses AI. Check for mistakes.
Comment on lines +107 to +112
start = time.perf_counter()
kg.timeline()
elapsed_ms = (time.perf_counter() - start) * 1000
latencies.append(elapsed_ms)

avg_ms = sum(latencies) / len(latencies)
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

result is assigned but never used; CI runs ruff check . so this will fail with F841. Remove the assignment or use the result (e.g., assert it returns <= 100 items).

Copilot uses AI. Check for mistakes.
result_march = kg.query_entity("Alice", as_of="2024-03-15")
# Query Alice as of September 2024 — should find ProjectB
result_sept = kg.query_entity("Alice", as_of="2024-09-15")

Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

project_names is computed but never used; CI runs ruff check . so this will fail with F841. Remove it or use it for a basic correctness assertion (e.g., that ProjectA appears for the March query).

Suggested change
march_project_names = {
item.get("object")
for item in result_march
if isinstance(item, dict) and item.get("predicate") == "works_on"
} if isinstance(result_march, list) else set()
sept_project_names = {
item.get("object")
for item in result_sept
if isinstance(item, dict) and item.get("predicate") == "works_on"
} if isinstance(result_sept, list) else set()
assert "ProjectA" in march_project_names
assert "ProjectB" in sept_project_names

Copilot uses AI. Check for mistakes.
Comment thread pyproject.toml
Comment on lines 65 to +72
[tool.pytest.ini_options]
testpaths = ["tests"]
pythonpath = ["."]
markers = [
"benchmark: scale/performance benchmark tests",
"slow: tests that take more than 30 seconds",
"stress: destructive scale tests (100K+ drawers)",
]
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding 106 benchmark tests under tests/ means a plain pytest run will now execute long-running benchmarks unless users remember to filter/ignore them. Consider setting a default addopts to skip benchmark/slow/stress markers (and require -m benchmark to run them), or moving benchmarks outside the default test discovery path.

Copilot uses AI. Check for mistakes.
Comment on lines +58 to +60
kg = KnowledgeGraph(db_path=str(tmp_path / "kg.sqlite3"))

# Create a hub entity connected to many others
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gen is assigned but never used; CI runs ruff check . so this will fail with F841. Remove the unused variable assignment (or use it if you intended to generate additional data).

Copilot uses AI. Check for mistakes.
Comment on lines +163 to +165
for i in range(100):
kg.add_entity(f"Entity_{i}", "concept")

Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gen is assigned but never used; CI runs ruff check . so this will fail with F841. Remove the unused variable assignment.

Copilot uses AI. Check for mistakes.
igorls added 2 commits April 8, 2026 10:56
- Run ruff format on all benchmark files (fixes CI lint job)
- Fix check_regression() substring ambiguity: ordered keyword matching
  so "latency_improvement_pct" is correctly classified as higher-is-better
- Update stale comments in conftest.py referencing wrong fixture
- Add pytest addopts to skip benchmark/slow/stress markers by default
Too heavy for CI (~2h per run). Benchmarks can be run locally with:
  pytest -m benchmark --bench-scale=small --bench-report=results.json
@bgauryy
Copy link
Copy Markdown

bgauryy commented Apr 8, 2026

PR Review: bench: add scale benchmark suite (106 tests)

Executive Summary

Aspect Value
PR Goal Add a comprehensive scale benchmark suite to measure mempalace performance at 1K–100K drawers
Files Changed 16 (2 modified, 14 added)
Risk Level 🟢 LOW — test-only PR, no production code changed
Review Effort 3 — large surface area but straightforward patterns
Recommendation 🔄 REQUEST_CHANGES

Affected Areas: tests/benchmarks/ (new), pyproject.toml, .github/workflows/ci.yml

Business Impact: Enables data-driven performance decisions. Key findings already surfaced (recall ceiling at ~1K drawers/bucket, client re-instantiation overhead, unbounded metadata fetch).

Flow Changes: No production flow changes. Adds pytest markers, CLI options, and a pythonpath directive that affect all test invocations.

Ratings

Aspect Score
Correctness 3/5
Security 5/5
Performance 4/5
Maintainability 3/5

PR Health


High Priority Issues

(Must fix before merge)

🐛 #1: record_metric() has a read-modify-write race on a shared temp file

Location: tests/benchmarks/report.py:16-31 | Confidence: ✅ HIGH

record_metric() reads a JSON file, updates it in-memory, and writes it back — all without file locking. Under parallel test execution (pytest-xdist / -n auto), multiple processes will race on this file, silently dropping metrics or corrupting the JSON.

Even without xdist, pytest's fixture teardown ordering is not guaranteed across sessions, so the pytest_terminal_summary hook reading this file could hit a partially-written state.

- def record_metric(category: str, metric: str, value):
-     """Append a metric to the session results file (JSON on disk)."""
-     results = {}
-     if os.path.exists(RESULTS_FILE):
-         try:
-             with open(RESULTS_FILE) as f:
-                 results = json.load(f)
-         except (json.JSONDecodeError, OSError):
-             results = {}
-     if category not in results:
-         results[category] = {}
-     results[category][metric] = value
-     with open(RESULTS_FILE, "w") as f:
-         json.dump(results, f, indent=2)
+ import fcntl
+
+ def record_metric(category: str, metric: str, value):
+     """Append a metric to the session results file (JSON on disk)."""
+     with open(RESULTS_FILE, "a+") as f:
+         fcntl.flock(f, fcntl.LOCK_EX)
+         try:
+             f.seek(0)
+             content = f.read()
+             results = json.loads(content) if content.strip() else {}
+             if category not in results:
+                 results[category] = {}
+             results[category][metric] = value
+             f.seek(0)
+             f.truncate()
+             json.dump(results, f, indent=2)
+         finally:
+             fcntl.flock(f, fcntl.LOCK_UN)

Note: fcntl is Unix-only. For cross-platform support, use filelock (PyPI) or document that benchmarks are single-process only.


🏗️ #2: CI benchmark job is documented but not implemented

Location: .github/workflows/ci.yml:21 + tests/benchmarks/README.md:168-175 | Confidence: ✅ HIGH

The README documents a CI benchmark job:

benchmark:
  runs-on: ubuntu-latest
  if: github.event_name == 'pull_request'
  # Runs: pytest tests/benchmarks/ -m "benchmark and not stress and not slow" --bench-scale=small

But the actual CI change only adds --ignore=tests/benchmarks to the existing test step. No benchmark job is added. The regression detection infrastructure (check_regression(), --bench-report) is unreachable from CI.

  # In .github/workflows/ci.yml, after the existing test job:
+
+  benchmark:
+    runs-on: ubuntu-latest
+    if: github.event_name == 'pull_request'
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.11"
+      - run: pip install -e ".[dev]"
+      - run: python -m pytest tests/benchmarks/ -v -m "benchmark and not stress and not slow" --bench-scale=small

Either add the job or remove the claim from the README.


Medium Priority Issues

(Should fix, not blocking)

🔗 #3: _get_rss_mb() duplicated across 5 test files

Location: test_chromadb_stress.py:20, test_ingest_bench.py:20, test_layers_bench.py:16, test_mcp_bench.py:47, test_memory_profile.py:20 | Confidence: ✅ HIGH

The identical 10-line _get_rss_mb() function is copy-pasted in 5 files. Any future bug fix or platform handling change must be replicated 5 times.

  # In conftest.py, add:
+ def _get_rss_mb():
+     """Get current process RSS in MB. Cross-platform."""
+     try:
+         import psutil
+         return psutil.Process().memory_info().rss / (1024 * 1024)
+     except ImportError:
+         import resource, platform
+         usage = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
+         if platform.system() == "Darwin":
+             return usage / (1024 * 1024)
+         return usage / 1024
+
+ @pytest.fixture
+ def get_rss_mb():
+     return _get_rss_mb

Then import from conftest in each test file, or use it as a fixture.


#4: pythonpath = ["."] added globally affects all test runs

Location: pyproject.toml:67 | Confidence: ⚠️ MED

Adding pythonpath = ["."] to [tool.pytest.ini_options] modifies sys.path for every pytest invocation — not just benchmarks. This is needed because benchmarks use from tests.benchmarks.data_generator import ..., but it also means any top-level directory could shadow installed packages.

Since the project uses editable install (pip install -e ".[dev]"), this is currently harmless. But consider scoping it:

- pythonpath = ["."]

And instead adjust benchmark imports to be relative, or add a conftest.py at the tests/ root that handles path setup.


🐛 #5: Percentile calculations on 5 samples are statistically meaningless

Location: test_chromadb_stress.py:102, test_search_bench.py:51 | Confidence: ✅ HIGH

With only 5 query samples, p95 = sorted(latencies)[int(5 * 0.95)] = sorted(latencies)[4] = the maximum value. The "p95" label is misleading — it's just the max. Similarly, p50 with 5 samples is just the median of 5.

  # In test_query_latency_at_size and test_search_latency_curve:
  # Either increase the sample count:
- queries = [
-     "authentication middleware optimization",
-     ...  # 5 queries
- ]
+ queries = [...] * 6  # 30 total for meaningful percentiles

  # Or rename the metrics honestly:
- record_metric("chromadb_query", f"p95_latency_ms_at_{n_drawers}", round(p95_ms, 1))
+ record_metric("chromadb_query", f"max_latency_ms_at_{n_drawers}", round(max(latencies), 1))

🚨 #6: _patch_mcp_config silently ignores collection_name

Location: test_mcp_bench.py:32-45 (specifically line 39) | Confidence: ⚠️ MED

The monkeypatch replaces _file_config with {"palace_path": palace_path} but omits collection_name. This means _config.collection_name falls back to the default ("mempalace_drawers"), which happens to match the hardcoded collection name in populate_palace_directly(). If either default changes, these tests silently break.

- monkeypatch.setattr(cfg, "_file_config", {"palace_path": palace_path})
+ monkeypatch.setattr(cfg, "_file_config", {
+     "palace_path": palace_path,
+     "collection_name": "mempalace_drawers",
+ })

Low Priority Issues

(Nice to have)

🎨 #7: __import__("datetime") inline instead of normal import

Location: tests/benchmarks/conftest.py:118 | Confidence: ✅ HIGH

+ from datetime import datetime
  ...
  report = {
-     "timestamp": __import__("datetime").datetime.now().isoformat(),
+     "timestamp": datetime.now().isoformat(),

🏗️ #8: Double exclusion of benchmarks — marker filter AND --ignore

Location: pyproject.toml:68 + .github/workflows/ci.yml:21 | Confidence: ✅ HIGH

addopts = "-m 'not benchmark and not slow and not stress'" already excludes all benchmark tests from default pytest runs. The CI --ignore=tests/benchmarks is redundant. Having both is harmless but adds confusion about which mechanism is the source of truth.

The addopts marker approach is preferable — it also protects local pytest runs. The --ignore approach avoids even collecting the benchmark files (slightly faster). Pick one and document why.


🎨 #9: data_generator.py needle query extraction is fragile

Location: tests/benchmarks/data_generator.py:287-291 | Confidence: ⚠️ MED

The query extraction uses string splitting that depends on specific delimiters:

"query": topic.split(" uses ")[0]
    if " uses " in topic
    else topic.split(" set to ")[0]
    if " set to " in topic
    else topic[:60],

If a needle topic doesn't contain " uses " or " set to ", the query is just the first 60 chars, which may not be semantically distinct enough for recall measurement. Of 20 topics: 7 contain " uses ", 5 contain " set to ", and 8 fall through to the 60-char truncation path (40%). Consider adding explicit query strings per needle rather than relying on delimiter splitting.


Created by Octocode MCP https://octocode.ai

Merged both the PR's benchmark suite additions (psutil dep, pytest
markers, --ignore=tests/benchmarks) and upstream's coverage changes
(pytest-cov, --cov-fail-under=30, coverage config) so both coexist.

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
@bensig bensig merged commit 9b705d6 into MemPalace:main Apr 8, 2026
4 checks passed
@milla-jovovich milla-jovovich mentioned this pull request Apr 9, 2026
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants