bench: add scale benchmark suite (106 tests) by igorls · Pull Request #223 · MemPalace/mempalace

igorls · 2026-04-08T08:06:52Z

Summary

Adds a full scale benchmark suite under tests/benchmarks/ (9 test modules, 106 tests)
Benchmarks cover MCP tool OOM thresholds, ChromaDB query degradation, search recall@k, mining throughput, knowledge graph concurrency, memory leak detection, palace boost quantification, and per-room recall ceiling
Deterministic data generator with planted needles for recall measurement without an LLM judge
JSON report output with regression detection (--bench-report flag)
CI benchmark job on PRs at small scale
psutil added as dev dependency for RSS tracking

Key findings from initial runs

Recall degrades past ~1K drawers per bucket — drops to 0.4-0.5 at 5K regardless of wing/room filtering. The palace structure helps by keeping buckets small, but can't fix the underlying embedding model ceiling.
Memory leak fixed upstream — search_memories() RSS growth dropped from 43 MB/100 calls to 0.09 MB/100 calls after PR fix: add limit=10000 safety cap to all unbounded ChromaDB .get() calls #137.
Client re-instantiation — 3.7x overhead from creating a new PersistentClient per call.
Unbounded metadata fetch — tool_status() and Layer1.generate() load all metadata linearly (57ms + 2.7MB at 5K drawers).
KG insertion — ~105 triples/sec after hardening PR; concurrent access works (0 lock failures).

Test plan

uv run pytest tests/benchmarks/ -v --bench-scale=small — 106 passed, 0 failed
uv run pytest tests/ -v --ignore=tests/benchmarks — existing 20 tests unaffected
Verified on latest upstream (post PRs fix: enable SQLite WAL mode and add consistent LIMIT to KG timeline #136, fix: add limit=10000 safety cap to all unbounded ChromaDB .get() calls #137, fix: sanitize error responses and remove sys.exit from library code #139, fix: sanitize SESSION_ID in save hook to prevent path traversal #141, chore: tighten chromadb version range and add py.typed marker #142)

Benchmark mempalace at configurable scale (1K–100K drawers) to find real-world performance limits. Tests cover MCP tool OOM thresholds, ChromaDB query degradation, search recall@k, mining throughput, knowledge graph concurrency, memory leak detection, palace boost quantification, and Layer1 unbounded fetch behavior. - tests/benchmarks/ with 8 test modules + data generator + report system - Deterministic data factory with planted needles for recall measurement - JSON report output with regression detection (--bench-report flag) - CI benchmark job on PRs at small scale - psutil added as dev dependency for RSS tracking

Concentrates all drawers into a single wing+room to isolate the embedding model's retrieval limit independent of palace filtering. Confirms recall degrades to ~0.4-0.5 at 5K drawers per room even with wing+room filters applied — the spatial structure helps by keeping buckets small, but can't fix the underlying embedding ceiling.

Remove unused imports (shutil, string, datetime, os, yaml, time, SCALE_CONFIGS) and unused variable assignments in timing-only calls.

Copilot

Pull request overview

This PR introduces a dedicated scale/performance benchmark suite to measure MemPalace behavior as data volumes grow, and wires it into CI (PR-only) with JSON reporting support.

Changes:

Added a new tests/benchmarks/ suite (multiple modules) covering search latency/recall, MCP tool behavior, ChromaDB stress, ingestion throughput, memory/RSS profiling, layers/wake-up cost, and KG concurrency/perf.
Added a deterministic benchmark data generator with planted “needles” plus JSON metric recording and report/regression utilities.
Updated packaging/CI to support benchmarks (adds psutil to dev deps, registers pytest markers, and adds a PR-only benchmark GitHub Actions job).

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 22 comments.

Show a summary per file

File	Description
tests/benchmarks/test_search_bench.py	Search latency/recall/concurrency and `n_results` scaling benchmarks
tests/benchmarks/test_recall_threshold.py	Single-room “bucket size” recall ceiling benchmark
tests/benchmarks/test_palace_boost.py	Quantifies recall/latency impact of wing/room filtering (“palace boost”)
tests/benchmarks/test_memory_profile.py	RSS and heap snapshot benchmarks for high-risk paths
tests/benchmarks/test_mcp_bench.py	MCP tool latency/RSS benchmarks and client re-instantiation cost
tests/benchmarks/test_layers_bench.py	Wake-up/layer performance and token budget checks
tests/benchmarks/test_knowledge_graph_bench.py	KG throughput/latency/temporal correctness and SQLite concurrency benchmarks
tests/benchmarks/test_ingest_bench.py	Mining/chunking throughput, peak RSS, and re-ingest skip overhead benchmarks
tests/benchmarks/test_chromadb_stress.py	ChromaDB stress/degradation + bulk insert comparisons
tests/benchmarks/data_generator.py	Seeded data factory + planted needles and KG triple generation
tests/benchmarks/report.py	Metric recording and regression comparison utilities
tests/benchmarks/conftest.py	Benchmark pytest options/markers/report writing hook
tests/benchmarks/README.md	Benchmark suite documentation and run instructions
tests/benchmarks/init.py	Benchmark package marker
pyproject.toml	Adds `psutil` to dev deps + pytest markers/pythonpath
.github/workflows/ci.yml	Runs unit tests excluding benchmarks; adds PR-only benchmark job + artifact upload

Comments suppressed due to low confidence (1)

tests/benchmarks/test_knowledge_graph_bench.py:281

result is assigned but never used; CI runs ruff check . so this will fail with F841. Remove the assignment or use the result (e.g., assert it’s a dict with expected keys).

            elapsed_ms = (time.perf_counter() - start) * 1000
            latencies.append(elapsed_ms)

        avg_ms = sum(latencies) / len(latencies)
        record_metric("kg_stats", f"avg_ms_at_{n_triples}", round(avg_ms, 2))

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-08T08:14:32Z

+import json
+import os
+import tempfile
+
+


datetime is imported but never used; ruff check . in CI will fail with F401. Remove the unused import (or use it if intended).

Copilot · 2026-04-08T08:14:33Z

+    higher_is_worse = {
+        "latency", "rss", "memory", "oom", "lock_failures", "elapsed",
+        "p50_ms", "p95_ms", "p99_ms", "rss_delta_mb", "peak_rss_mb",
+    }
+    # Metrics where LOWER is worse (throughput, recall, etc.)
+    lower_is_worse = {
+        "recall", "throughput", "per_sec", "files_per_sec", "drawers_per_sec",
+        "triples_per_sec", "improvement",
+    }
+
+    for category in baseline.get("results", {}):
+        if category not in current.get("results", {}):
+            continue
+        for metric, base_val in baseline["results"][category].items():
+            if metric not in current["results"][category]:
+                continue
+            curr_val = current["results"][category][metric]
+            if not isinstance(base_val, (int, float)) or not isinstance(curr_val, (int, float)):
+                continue
+            if base_val == 0:
+                continue
+
+            # Determine direction
+            is_latency_like = any(kw in metric.lower() for kw in higher_is_worse)
+            is_throughput_like = any(kw in metric.lower() for kw in lower_is_worse)
+
+            if is_latency_like:
+                # Higher is worse — check if current exceeds baseline by threshold
+                if curr_val > base_val * (1 + threshold):
+                    pct = ((curr_val - base_val) / base_val) * 100
+                    regressions.append(
+                        f"{category}/{metric}: {base_val:.2f} -> {curr_val:.2f} ({pct:+.1f}%, threshold {threshold*100:.0f}%)"
+                    )
+            elif is_throughput_like:
+                # Lower is worse — check if current is below baseline by threshold


check_regression() classifies metrics by substring matching. A metric like latency_improvement_pct matches both “latency” (treated as higher-is-worse) and “improvement” (higher-is-better), but the current logic prioritizes the latency branch, so improved latency can be flagged as a regression. Consider an explicit per-metric direction map (or precedence rules), or rename/namespace improvement metrics so they don’t include latency in the key.

Copilot · 2026-04-08T08:14:33Z

+import json
+import os
+import tempfile
+
+import pytest


shutil is imported but never used; this will fail ruff check . (F401). Remove the unused import.

Copilot · 2026-04-08T08:14:33Z

+    # The results are written by individual tests via bench_results fixture
+    import platform
+    import subprocess
+
+    try:
+        git_sha = subprocess.check_output(


The comments in pytest_terminal_summary say results are written via the bench_results fixture, but the implementation actually reads the temp JSON file written by record_metric(). This mismatch is confusing for future maintenance; either wire tests to use bench_results or update/remove the stale comments/fixture.

Copilot · 2026-04-08T08:14:33Z

+import hashlib
+import os
+import random
+from datetime import datetime, timedelta
+from pathlib import Path
+


string is imported but never used; this will fail ruff check . (F401). Remove the unused import.

Copilot · 2026-04-08T08:14:37Z

+            start = time.perf_counter()
+            kg.timeline()
+            elapsed_ms = (time.perf_counter() - start) * 1000
+            latencies.append(elapsed_ms)
+
+        avg_ms = sum(latencies) / len(latencies)


result is assigned but never used; CI runs ruff check . so this will fail with F841. Remove the assignment or use the result (e.g., assert it returns <= 100 items).

Copilot · 2026-04-08T08:14:37Z

+        result_march = kg.query_entity("Alice", as_of="2024-03-15")
+        # Query Alice as of September 2024 — should find ProjectB
+        result_sept = kg.query_entity("Alice", as_of="2024-09-15")
+


project_names is computed but never used; CI runs ruff check . so this will fail with F841. Remove it or use it for a basic correctness assertion (e.g., that ProjectA appears for the March query).

Suggested change

march_project_names = {

item.get("object")

for item in result_march

if isinstance(item, dict) and item.get("predicate") == "works_on"

} if isinstance(result_march, list) else set()

sept_project_names = {

item.get("object")

for item in result_sept

if isinstance(item, dict) and item.get("predicate") == "works_on"

} if isinstance(result_sept, list) else set()

assert "ProjectA" in march_project_names

assert "ProjectB" in sept_project_names

Copilot · 2026-04-08T08:14:37Z

 [tool.pytest.ini_options]
 testpaths = ["tests"]
+pythonpath = ["."]
+markers = [
+    "benchmark: scale/performance benchmark tests",
+    "slow: tests that take more than 30 seconds",
+    "stress: destructive scale tests (100K+ drawers)",
+]


Adding 106 benchmark tests under tests/ means a plain pytest run will now execute long-running benchmarks unless users remember to filter/ignore them. Consider setting a default addopts to skip benchmark/slow/stress markers (and require -m benchmark to run them), or moving benchmarks outside the default test discovery path.

Copilot · 2026-04-08T08:14:38Z

+        kg = KnowledgeGraph(db_path=str(tmp_path / "kg.sqlite3"))
+
+        # Create a hub entity connected to many others


gen is assigned but never used; CI runs ruff check . so this will fail with F841. Remove the unused variable assignment (or use it if you intended to generate additional data).

Copilot · 2026-04-08T08:14:38Z

+        for i in range(100):
+            kg.add_entity(f"Entity_{i}", "concept")
+


gen is assigned but never used; CI runs ruff check . so this will fail with F841. Remove the unused variable assignment.

- Run ruff format on all benchmark files (fixes CI lint job) - Fix check_regression() substring ambiguity: ordered keyword matching so "latency_improvement_pct" is correctly classified as higher-is-better - Update stale comments in conftest.py referencing wrong fixture - Add pytest addopts to skip benchmark/slow/stress markers by default

Too heavy for CI (~2h per run). Benchmarks can be run locally with: pytest -m benchmark --bench-scale=small --bench-report=results.json

bgauryy · 2026-04-08T15:03:53Z

PR Review: bench: add scale benchmark suite (106 tests)

Executive Summary

Aspect	Value
PR Goal	Add a comprehensive scale benchmark suite to measure mempalace performance at 1K–100K drawers
Files Changed	16 (2 modified, 14 added)
Risk Level	🟢 LOW — test-only PR, no production code changed
Review Effort	3 — large surface area but straightforward patterns
Recommendation	🔄 REQUEST_CHANGES

Affected Areas: tests/benchmarks/ (new), pyproject.toml, .github/workflows/ci.yml

Business Impact: Enables data-driven performance decisions. Key findings already surfaced (recall ceiling at ~1K drawers/bucket, client re-instantiation overhead, unbounded metadata fetch).

Flow Changes: No production flow changes. Adds pytest markers, CLI options, and a pythonpath directive that affect all test invocations.

Ratings

Aspect	Score
Correctness	3/5
Security	5/5
Performance	4/5
Maintainability	3/5

PR Health

Has clear description
References related issues/PRs (fix: enable SQLite WAL mode and add consistent LIMIT to KG timeline #136, fix: add limit=10000 safety cap to all unbounded ChromaDB .get() calls #137, fix: sanitize error responses and remove sys.exit from library code #139, fix: sanitize SESSION_ID in save hook to prevent path traversal #141, chore: tighten chromadb version range and add py.typed marker #142)
Appropriate size (2,851 lines — large but cohesive)
Has relevant tests (it IS the tests)

High Priority Issues

(Must fix before merge)

🐛 #1: `record_metric()` has a read-modify-write race on a shared temp file

Location: tests/benchmarks/report.py:16-31 | Confidence: ✅ HIGH

record_metric() reads a JSON file, updates it in-memory, and writes it back — all without file locking. Under parallel test execution (pytest-xdist / -n auto), multiple processes will race on this file, silently dropping metrics or corrupting the JSON.

Even without xdist, pytest's fixture teardown ordering is not guaranteed across sessions, so the pytest_terminal_summary hook reading this file could hit a partially-written state.

- def record_metric(category: str, metric: str, value):
-     """Append a metric to the session results file (JSON on disk)."""
-     results = {}
-     if os.path.exists(RESULTS_FILE):
-         try:
-             with open(RESULTS_FILE) as f:
-                 results = json.load(f)
-         except (json.JSONDecodeError, OSError):
-             results = {}
-     if category not in results:
-         results[category] = {}
-     results[category][metric] = value
-     with open(RESULTS_FILE, "w") as f:
-         json.dump(results, f, indent=2)
+ import fcntl
+
+ def record_metric(category: str, metric: str, value):
+     """Append a metric to the session results file (JSON on disk)."""
+     with open(RESULTS_FILE, "a+") as f:
+         fcntl.flock(f, fcntl.LOCK_EX)
+         try:
+             f.seek(0)
+             content = f.read()
+             results = json.loads(content) if content.strip() else {}
+             if category not in results:
+                 results[category] = {}
+             results[category][metric] = value
+             f.seek(0)
+             f.truncate()
+             json.dump(results, f, indent=2)
+         finally:
+             fcntl.flock(f, fcntl.LOCK_UN)

Note: fcntl is Unix-only. For cross-platform support, use filelock (PyPI) or document that benchmarks are single-process only.

🏗️ #2: CI benchmark job is documented but not implemented

Location: .github/workflows/ci.yml:21 + tests/benchmarks/README.md:168-175 | Confidence: ✅ HIGH

The README documents a CI benchmark job:

benchmark:
  runs-on: ubuntu-latest
  if: github.event_name == 'pull_request'
  # Runs: pytest tests/benchmarks/ -m "benchmark and not stress and not slow" --bench-scale=small

But the actual CI change only adds --ignore=tests/benchmarks to the existing test step. No benchmark job is added. The regression detection infrastructure (check_regression(), --bench-report) is unreachable from CI.

  # In .github/workflows/ci.yml, after the existing test job:
+
+  benchmark:
+    runs-on: ubuntu-latest
+    if: github.event_name == 'pull_request'
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.11"
+      - run: pip install -e ".[dev]"
+      - run: python -m pytest tests/benchmarks/ -v -m "benchmark and not stress and not slow" --bench-scale=small

Either add the job or remove the claim from the README.

Medium Priority Issues

(Should fix, not blocking)

🔗 #3: `_get_rss_mb()` duplicated across 5 test files

Location: test_chromadb_stress.py:20, test_ingest_bench.py:20, test_layers_bench.py:16, test_mcp_bench.py:47, test_memory_profile.py:20 | Confidence: ✅ HIGH

The identical 10-line _get_rss_mb() function is copy-pasted in 5 files. Any future bug fix or platform handling change must be replicated 5 times.

  # In conftest.py, add:
+ def _get_rss_mb():
+     """Get current process RSS in MB. Cross-platform."""
+     try:
+         import psutil
+         return psutil.Process().memory_info().rss / (1024 * 1024)
+     except ImportError:
+         import resource, platform
+         usage = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
+         if platform.system() == "Darwin":
+             return usage / (1024 * 1024)
+         return usage / 1024
+
+ @pytest.fixture
+ def get_rss_mb():
+     return _get_rss_mb

Then import from conftest in each test file, or use it as a fixture.

⚡ #4: `pythonpath = ["."]` added globally affects all test runs

Location: pyproject.toml:67 | Confidence: ⚠️ MED

Adding pythonpath = ["."] to [tool.pytest.ini_options] modifies sys.path for every pytest invocation — not just benchmarks. This is needed because benchmarks use from tests.benchmarks.data_generator import ..., but it also means any top-level directory could shadow installed packages.

Since the project uses editable install (pip install -e ".[dev]"), this is currently harmless. But consider scoping it:

- pythonpath = ["."]

And instead adjust benchmark imports to be relative, or add a conftest.py at the tests/ root that handles path setup.

🐛 #5: Percentile calculations on 5 samples are statistically meaningless

Location: test_chromadb_stress.py:102, test_search_bench.py:51 | Confidence: ✅ HIGH

With only 5 query samples, p95 = sorted(latencies)[int(5 * 0.95)] = sorted(latencies)[4] = the maximum value. The "p95" label is misleading — it's just the max. Similarly, p50 with 5 samples is just the median of 5.

  # In test_query_latency_at_size and test_search_latency_curve:
  # Either increase the sample count:
- queries = [
-     "authentication middleware optimization",
-     ...  # 5 queries
- ]
+ queries = [...] * 6  # 30 total for meaningful percentiles

  # Or rename the metrics honestly:
- record_metric("chromadb_query", f"p95_latency_ms_at_{n_drawers}", round(p95_ms, 1))
+ record_metric("chromadb_query", f"max_latency_ms_at_{n_drawers}", round(max(latencies), 1))

🚨 #6: `_patch_mcp_config` silently ignores `collection_name`

Location: test_mcp_bench.py:32-45 (specifically line 39) | Confidence: ⚠️ MED

The monkeypatch replaces _file_config with {"palace_path": palace_path} but omits collection_name. This means _config.collection_name falls back to the default ("mempalace_drawers"), which happens to match the hardcoded collection name in populate_palace_directly(). If either default changes, these tests silently break.

- monkeypatch.setattr(cfg, "_file_config", {"palace_path": palace_path})
+ monkeypatch.setattr(cfg, "_file_config", {
+     "palace_path": palace_path,
+     "collection_name": "mempalace_drawers",
+ })

Low Priority Issues

(Nice to have)

🎨 #7: `import("datetime")` inline instead of normal import

Location: tests/benchmarks/conftest.py:118 | Confidence: ✅ HIGH

+ from datetime import datetime
  ...
  report = {
-     "timestamp": __import__("datetime").datetime.now().isoformat(),
+     "timestamp": datetime.now().isoformat(),

🏗️ #8: Double exclusion of benchmarks — marker filter AND `--ignore`

Location: pyproject.toml:68 + .github/workflows/ci.yml:21 | Confidence: ✅ HIGH

addopts = "-m 'not benchmark and not slow and not stress'" already excludes all benchmark tests from default pytest runs. The CI --ignore=tests/benchmarks is redundant. Having both is harmless but adds confusion about which mechanism is the source of truth.

The addopts marker approach is preferable — it also protects local pytest runs. The --ignore approach avoids even collecting the benchmark files (slightly faster). Pick one and document why.

🎨 #9: `data_generator.py` needle query extraction is fragile

Location: tests/benchmarks/data_generator.py:287-291 | Confidence: ⚠️ MED

The query extraction uses string splitting that depends on specific delimiters:

"query": topic.split(" uses ")[0]
    if " uses " in topic
    else topic.split(" set to ")[0]
    if " set to " in topic
    else topic[:60],

If a needle topic doesn't contain " uses " or " set to ", the query is just the first 60 chars, which may not be semantically distinct enough for recall measurement. Of 20 topics: 7 contain " uses ", 5 contain " set to ", and 8 fall through to the 60-char truncation path (40%). Consider adding explicit query strings per needle rather than relying on delimiter splitting.

Created by Octocode MCP https://octocode.ai

Merged both the PR's benchmark suite additions (psutil dep, pytest markers, --ignore=tests/benchmarks) and upstream's coverage changes (pytest-cov, --cov-fail-under=30, coverage config) so both coexist. Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

igorls added 2 commits April 8, 2026 05:06

Copilot AI review requested due to automatic review settings April 8, 2026 08:06

Copilot started reviewing on behalf of igorls April 8, 2026 08:07 View session

fix: resolve ruff lint errors in benchmark suite

7e4db33

Remove unused imports (shutil, string, datetime, os, yaml, time, SCALE_CONFIGS) and unused variable assignments in timing-only calls.

Copilot AI reviewed Apr 8, 2026

View reviewed changes

igorls added 2 commits April 8, 2026 10:56

ci: remove benchmark job from CI workflow

37e1576

Too heavy for CI (~2h per run). Benchmarks can be run locally with: pytest -m benchmark --bench-scale=small --bench-report=results.json

bgauryy mentioned this pull request Apr 8, 2026

feat: improve architecture, remove redundacies and duplicaions, centralize ChromaDB into palace_db singleton, consolidate MCP tools 19→14, add test suite #25

Closed

2 tasks

bensig merged commit 9b705d6 into MemPalace:main Apr 8, 2026
4 checks passed

milla-jovovich mentioned this pull request Apr 9, 2026

chore: bump version to 3.1.0 #409

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench: add scale benchmark suite (106 tests)#223

bench: add scale benchmark suite (106 tests)#223
bensig merged 6 commits intoMemPalace:mainfrom
igorls:bench/scale-test-suite

igorls commented Apr 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

bgauryy commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

+        march_project_names = {
+            item.get("object")
+            for item in result_march
+            if isinstance(item, dict) and item.get("predicate") == "works_on"
+        } if isinstance(result_march, list) else set()
+        sept_project_names = {
+            item.get("object")
+            for item in result_sept
+            if isinstance(item, dict) and item.get("predicate") == "works_on"
+        } if isinstance(result_sept, list) else set()
+        assert "ProjectA" in march_project_names
+        assert "ProjectB" in sept_project_names

		kg = KnowledgeGraph(db_path=str(tmp_path / "kg.sqlite3"))

		# Create a hub entity connected to many others

Conversation

igorls commented Apr 8, 2026

Summary

Key findings from initial runs

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

bgauryy commented Apr 8, 2026

PR Review: bench: add scale benchmark suite (106 tests)

Executive Summary

Ratings

PR Health

High Priority Issues

🐛 #1: record_metric() has a read-modify-write race on a shared temp file

🏗️ #2: CI benchmark job is documented but not implemented

Medium Priority Issues

🔗 #3: _get_rss_mb() duplicated across 5 test files

⚡ #4: pythonpath = ["."] added globally affects all test runs

🐛 #5: Percentile calculations on 5 samples are statistically meaningless

🚨 #6: _patch_mcp_config silently ignores collection_name

Low Priority Issues

🎨 #7: __import__("datetime") inline instead of normal import

🏗️ #8: Double exclusion of benchmarks — marker filter AND --ignore

🎨 #9: data_generator.py needle query extraction is fragile

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

🐛 #1: `record_metric()` has a read-modify-write race on a shared temp file

🔗 #3: `_get_rss_mb()` duplicated across 5 test files

⚡ #4: `pythonpath = ["."]` added globally affects all test runs

🚨 #6: `_patch_mcp_config` silently ignores `collection_name`

🎨 #7: `import("datetime")` inline instead of normal import

🏗️ #8: Double exclusion of benchmarks — marker filter AND `--ignore`

🎨 #9: `data_generator.py` needle query extraction is fragile