Skip to content

feat: add min_similarity threshold to search tool#350

Open
hkf57 wants to merge 4 commits intoMemPalace:developfrom
hkf57:feat/min-similarity-threshold
Open

feat: add min_similarity threshold to search tool#350
hkf57 wants to merge 4 commits intoMemPalace:developfrom
hkf57:feat/min-similarity-threshold

Conversation

@hkf57
Copy link
Copy Markdown

@hkf57 hkf57 commented Apr 9, 2026

What does this PR do?

Adds a min_similarity parameter to search() and search_memories() that filters out results below a given similarity score. Defaults to 0.0, which discards negative-similarity noise without affecting positive matches. Supports the full cosine similarity range (-1.0 to 1.0) with input validation.

Motivation: I ran a 30-question retrieval benchmark against a 10,000-drawer palace (200 Confluence ADR pages). Overreach queries (where no relevant content exists) averaged -0.023 similarity, while valid hits were consistently 0.05+. The current behaviour of always returning N results - even with negative scores - means consumers have no "nothing found" signal. The default of 0.0 eliminates the worst false positives automatically; callers wanting stricter filtering can raise it (e.g. min_similarity=0.1), or lower it to include anti-correlated results (e.g. min_similarity=-0.5).

Rebased onto develop after the base branch change from main. This is now a clean, minimal diff (62 insertions) on top of develop's hybrid search, closet ranking, and BM25 re-ranking - none of which is touched by this PR.

Changes:

  • searcher.py: search() and search_memories() accept min_similarity (default 0.0), skip results below threshold. Works alongside existing max_distance filtering. Input validation rejects values outside -1.0 to 1.0.
  • mcp_server.py: Schema updated with min_similarity property including minimum/maximum constraints. The existing tool_search() handler already accepts min_similarity and converts to max_distance; this adds direct filtering in the searcher for programmatic callers.
  • test_searcher.py: 6 new tests covering filtering, negative thresholds, out-of-range validation, and total_before_filter. Fixed two flaky upstream test queries (test_result_fields, test_search_n_results) that used vague queries producing negative similarities against the small 4-doc test corpus.

Follows the same pattern as tool_check_duplicate, which already uses a similarity threshold.

How to test

python -m pytest tests/test_searcher.py -v

23 tests pass (811/811 full suite). The 6 new tests cover:

  • test_min_similarity_filters_low_scores - threshold of 0.99 returns empty results
  • test_min_similarity_keeps_high_scores - threshold just below top score preserves results
  • test_min_similarity_default_filters_negatives - default 0.0 guarantees no negative scores
  • test_min_similarity_negative_threshold - negative threshold includes anti-correlated content
  • test_min_similarity_out_of_range_returns_error - values outside -1.0 to 1.0 return error
  • test_total_before_filter - response shows how many results existed before filtering

Checklist

  • Tests pass (python -m pytest tests/ -v - 811 passed)
  • No hardcoded paths
  • Linter passes (ruff check .)
  • Clean rebase onto develop (no merge noise)

Copy link
Copy Markdown

@web3guru888 web3guru888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This directly addresses something we've been navigating in our integration. We run mempalace_search at different OODA phases with different quality requirements — Orient phase wants breadth (accepts lower similarity), Evaluate phase wants precision (rejects noise). Right now we handle this entirely in post-processing, discarding results below phase-specific thresholds after they come back from the search tool. A min_similarity parameter pushes that filtering to the right layer and avoids returning garbage to the MCP consumer in the first place.

The 0.0 default is the right call. Negative cosine similarity results are never useful — they're anti-correlated content — and silently returning them to callers contributes to exactly the class of silent wrong-answer bugs documented in #333.

A few thoughts on the implementation:

The searcher.py change looks clean. One edge case worth testing: when min_similarity filters all results, does search() return an empty list or raise? Empty list is the right behavior (caller can distinguish "nothing found" from an error), and from the PR description it sounds like that's what happens — but worth an explicit test case if not already covered.

On the MCP schema update: would it be worth adding a brief description string to the min_similarity property so LLM tool-callers understand what it controls? Something like "description": "Filter out results below this cosine similarity score. Default 0.0 removes negative-similarity noise." — models are sensitive to property descriptions when deciding whether to pass a parameter.

The benchmark numbers (overreach queries averaging -0.023, valid hits 0.05+) are a useful data point. We've seen similar distributions in our ChromaDB collections — the gap between noise and signal tends to be small in absolute terms but consistent enough that a fixed threshold works well in practice.

Overall this is a clean, well-motivated change.

Perseusxrltd added a commit to Perseusxrltd/mnemion that referenced this pull request Apr 9, 2026
- pyproject.toml: widen chromadb to <2.0 for Python 3.14 compat (MemPalace#302)
- config.py + miner/convo_miner/mcp_server: add hnsw:space=cosine so
  similarity = 1 - distance stays in [0,1] instead of negative L2 (MemPalace#304)
- searcher.py + layers.py: guard against ChromaDB 1.x empty-outer query
  results (IndexError on fresh collections) (MemPalace#305)
- mcp_server.py: redirect stdout→stderr at import to protect JSON-RPC
  wire from chromadb/posthog chatter (MemPalace#306)
- mcp_server.py: replace 10k-limited col.get with paginating
  _iter_all_metadatas helper; stop swallowing errors silently (MemPalace#307)
- mcp_server.py: drop undocumented wait_for_previous arg injected by
  Gemini MCP clients (MemPalace#322)
- searcher.py + hybrid_searcher.py + mcp_server.py: add min_similarity
  threshold filter so callers get a clean "no results" signal (MemPalace#350)
- entity_detector.py: add CODE_KEYWORDS blocklist (~80 terms) to stop
  Rust types / React / framework names being detected as entities (MemPalace#349)

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@bgauryy
Copy link
Copy Markdown

bgauryy commented Apr 9, 2026

PR Review: feat: add min_similarity threshold to search tool

Executive Summary

Aspect Value
PR Goal Add min_similarity parameter to filter low-confidence search results
Files Changed 3
Risk Level MEDIUM — changes default search behavior for all callers
Review Mode Quick
Review Effort 2/5
Recommendation REQUEST_CHANGES

Affected Areas: mempalace/searcher.py (core search), mempalace/mcp_server.py (MCP interface), tests/test_searcher.py

Business Impact: Improves search quality by filtering noise, but silently changes return behavior for existing integrations (MCP consumers, benchmarks, CLI)

Flow Changes: search_memories() and search() now post-filter ChromaDB results against min_similarity before returning. Default 0.0 silently strips negative-similarity results that were previously returned.

Ratings

Aspect Score
Correctness 3/5
Security 5/5
Performance 5/5
Maintainability 4/5

PR Health

  • Has clear description (excellent motivation with benchmark data)
  • References ticket/issue — N/A (standalone feature)
  • Appropriate size (47 lines, well-scoped)
  • Has relevant tests (3 new tests + 1 updated)

High Priority Issues

Bug #1: No input validation on min_similarity — invalid values silently return zero results

Location: mempalace/searcher.py:21 and mempalace/searcher.py:100 | Confidence: HIGH

The MCP schema describes min_similarity as "0-1" but neither the schema nor the Python code enforces any range. An MCP consumer passing min_similarity=50 (or any value > 1.0) silently receives zero results with no error or warning. This is especially dangerous in an MCP context where the caller is an AI agent that may not understand why results disappeared.

 def search(query: str, palace_path: str, wing: str = None, room: str = None, n_results: int = 5, min_similarity: float = 0.0):
     """
     Search the palace. Returns verbatim drawer content.
     Optionally filter by wing (project) or room (aspect).
     """
+    if not (0.0 <= min_similarity <= 1.0):
+        raise ValueError(f"min_similarity must be between 0.0 and 1.0, got {min_similarity}")

Same fix needed in search_memories (returns error dict to match existing error pattern):

 def search_memories(
     query: str, palace_path: str, wing: str = None, room: str = None, n_results: int = 5,
     min_similarity: float = 0.0,
 ) -> dict:
+    if not (0.0 <= min_similarity <= 1.0):
+        return {"error": f"min_similarity must be between 0.0 and 1.0, got {min_similarity}"}

Bug #2: Default 0.0 is a silent breaking change — existing callers lose negative-similarity results

Location: mempalace/searcher.py:21 and mempalace/searcher.py:100 | Confidence: HIGH

Before this PR, search_memories("query", palace_path) returned ALL results including those with negative similarity. After this PR, the same call silently drops negative-similarity results. This is a semantic breaking change hidden behind a default parameter.

Evidence within the PR itself: The existing test_result_fields test had to be modified — its query was changed from "authentication" to "JWT authentication tokens session management" and a new assert len(result["results"]) > 0 guard was added, because the original vague query no longer guaranteed positive-similarity results under the new default.

Blast radius — 7 callers reference search_memories:

  • mempalace/mcp_server.py — updated (OK)
  • tests/test_searcher.py — updated (OK)
  • tests/benchmarks/test_palace_boost.pyNOT updated, measures recall
  • tests/benchmarks/test_search_bench.pyNOT updated, measures timing
  • tests/benchmarks/test_memory_profile.pyNOT updated, measures RSS
  • tests/benchmarks/test_recall_threshold.pyNOT updated, measures recall@5 and recall@10
  • README.md — example code (cosmetic)

The benchmark tests that measure recall are most at risk: if any needle queries previously matched with negative similarity (unlikely but possible with adversarial data), recall numbers silently change.

Recommended fix: Either (a) keep default as None and only filter when explicitly set, preserving backward compatibility:

 def search_memories(
     query: str, palace_path: str, wing: str = None, room: str = None, n_results: int = 5,
-    min_similarity: float = 0.0,
+    min_similarity: float = None,
 ) -> dict:
     ...
     for doc, meta, dist in zip(docs, metas, dists):
         similarity = round(1 - dist, 3)
-        if similarity < min_similarity:
+        if min_similarity is not None and similarity < min_similarity:
             continue

Or (b) keep 0.0 as default but explicitly document the behavior change and update the benchmark tests to pass min_similarity=-1.0 to preserve old behavior.


Medium Priority Issues

Architecture #3: Post-filter can return fewer results than limit with no indication to caller

Location: mempalace/searcher.py:145-158 | Confidence: HIGH

ChromaDB is asked for n_results items, then min_similarity post-filters. If a user requests limit=5 with min_similarity=0.3, they may receive 0-4 results with no indication that filtering occurred. The return dict has no field to communicate "5 were fetched, 3 were filtered".

This matters because MCP consumers (AI agents) may interpret an empty result list as "no relevant content exists" when in fact relevant content exists just below the threshold.

     return {
         "query": query,
         "filters": {"wing": wing, "room": room},
         "results": hits,
+        "total_before_filter": len(docs),
+        "min_similarity": min_similarity,
     }

Code Quality #4: Filtering logic duplicated across search() and search_memories()

Location: mempalace/searcher.py:74-77 and mempalace/searcher.py:145-147 | Confidence: MED

Both search() (CLI) and search_memories() (programmatic) independently implement the same if similarity < min_similarity: continue filter with identical round(1 - dist, 3) conversion. If the filtering formula or rounding changes, both must be updated in sync. Consider extracting the similarity computation and filter into a shared helper.

Not blocking — the duplication is small and the two functions have different output formats (print vs dict).


Code Quality #5: MCP schema lacks minimum/maximum constraints

Location: mempalace/mcp_server.py:626-629 | Confidence: MED

The JSON Schema definition for min_similarity says "0-1" in the description text but doesn't use JSON Schema's built-in minimum/maximum keywords. MCP clients that validate inputs against the schema won't catch out-of-range values.

 "min_similarity": {
     "type": "number",
     "description": "Minimum similarity threshold 0-1 (default 0.0, discarding negative scores). Results below this are omitted.",
+    "minimum": 0.0,
+    "maximum": 1.0,
 },

Flow Impact Analysis

MCP Consumer ──► tool_search(min_similarity=X)
                    │
                    ▼
              search_memories(min_similarity=X)
                    │
                    ├──► ChromaDB.query(n_results=N)  ← fetches N results
                    │
                    ▼
              POST-FILTER: similarity < min_similarity → skip
                    │
                    ▼
              Return 0..N hits  ← caller may receive < N results

CLI User ────► cmd_search(args)
                    │
                    ▼
              search(min_similarity=0.0)  ← CLI has no --min-similarity flag
                    │
                    ▼
              Same post-filter logic, prints to stdout

Callers NOT updated in this PR that use the default:

Caller File Risk
cmd_search mempalace/cli.py LOW — CLI gets default 0.0, no user control
Recall benchmark tests/benchmarks/test_recall_threshold.py MED — recall measurement may shift
Palace boost benchmark tests/benchmarks/test_palace_boost.py MED — recall measurement may shift
Search benchmark tests/benchmarks/test_search_bench.py LOW — timing unaffected
Memory profiler tests/benchmarks/test_memory_profile.py LOW — RSS unaffected

Created by Octocode MCP https://octocode.ai

@hkf57
Copy link
Copy Markdown
Author

hkf57 commented Apr 9, 2026

Addressed the review feedback in 1dd1e2c:

Bug #1 (input validation): Added range validation for min_similarity. Accepts -1.0 to 1.0 (the full cosine similarity range). search() raises ValueError; search_memories() returns an error dict to match existing error patterns. Negative thresholds are supported for callers who want to include anti-correlated content.

Architecture #3 (filtered count): Response dict now includes total_before_filter so callers can distinguish "nothing matched the query" from "matches existed but fell below the threshold".

Code Quality #5 (schema constraints): Added minimum: -1.0 and maximum: 1.0 to the JSON Schema definition.

On Bug #2 (default 0.0 as breaking change): Keeping 0.0 as the default. Negative cosine similarity results are anti-correlated content; returning them is the current bug, not a feature. The benchmark data is clear: overreach queries average -0.023, valid hits are 0.05+. The total_before_filter field gives callers full visibility into what was filtered. Benchmark tests are excluded from default pytest runs via markers and don't call search_memories directly.

41/41 tests pass, ruff clean. 3 new tests added for negative thresholds, out-of-range validation, and total_before_filter.

@bensig bensig changed the base branch from main to develop April 11, 2026 22:22
Rebased cleanly onto develop after base branch change from main.

- Add min_similarity param (default 0.0) to search() and search_memories()
- Input validation: must be between -1.0 and 1.0
- Default 0.0 automatically filters negative similarities
- Works alongside existing max_distance filtering
- Add min_similarity to MCP schema with bounds
- Fix flaky test queries for small corpus compatibility
- 6 new tests covering threshold filtering, edge cases, validation
@hkf57 hkf57 force-pushed the feat/min-similarity-threshold branch from df48bfb to 8002e7e Compare April 13, 2026 22:29
@igorls igorls added area/mcp MCP server and tools area/search Search and retrieval enhancement New feature or request labels Apr 14, 2026
…rity-threshold

# Conflicts:
#	mempalace/searcher.py
@Qodo-Free-For-OSS
Copy link
Copy Markdown

Hi, search() and search_memories() clamp similarity to >=0.0, so anti-correlated (negative) cosine similarities can never be returned or filtered-in, contradicting the new API contract and tool schema that claim support for [-1.0, 1.0].

Severity: action required | Category: correctness

How to fix: Remove clamp or change contract

Agent prompt to fix - you can give this to your LLM of choice:

Issue description

Similarity is clamped to [0,1], but the PR claims similarity supports [-1,1] and that negative thresholds can include anti-correlated results.

Fix Focus Areas

  • mempalace/searcher.py[286-290]
  • mempalace/searcher.py[425-435]
  • mempalace/mcp_server.py[1371-1376]

What to change

  • Pick ONE consistent contract:
    • If you want true cosine similarity range [-1,1]: compute raw_similarity = 1 - dist (or 1 - effective_dist) without max(0.0, ...), then filter/return that value.
    • If you want similarity in [0,1] only: keep the clamp, but tighten validation/docs/schema to min_similarity in [0,1] and remove claims about negative similarity support.
  • Update tests accordingly (especially any tests implying negative similarity support).

We noticed a couple of other issues in this PR as well - happy to share if helpful.


Spotted by Qodo code review - free for open-source projects.

Clamped similarity to [0, 1] contradicted the documented [-1.0, 1.0]
contract — anti-correlated results were floored to 0.0, passing any
negative threshold but indistinguishable from a neutral match.
@hkf57
Copy link
Copy Markdown
Author

hkf57 commented Apr 21, 2026

Thanks — fixed in the latest push. Dropped the max(0.0, ...) clamp at both sites in searcher.py, so similarity is now the true cosine value in [-1.0, 1.0]. Validation, schema, and docs were already consistent with the negative-threshold contract; the clamp was the odd one out. All 27 searcher tests pass, including the negative-threshold case which now actually exercises negatives end-to-end.

Happy to hear the other issues you spotted.

@hkf57
Copy link
Copy Markdown
Author

hkf57 commented May 6, 2026

Friendly bump - the bot review feedback from 2026-04-21 was addressed in 1dd1e2c (dropped the max(0.0, ...) clamp at both searcher.py sites so similarity is the true cosine value in [-1.0, 1.0]). Branch is current against develop. Let me know if there is anything blocking review or further changes you would like.

Resolves four conflicts in mempalace/searcher.py around the BM25
hybrid rerank, closet-boost effective-distance clamp, and the
min_similarity threshold added by this PR:

* keep develop's _warn_if_legacy_metric helper alongside the new
  min_similarity parameter on search()
* in the CLI display loop, use develop's hits/bm25 iteration shape
  but apply this PR's threshold filter and unclamped vec_sim so
  negative thresholds remain meaningful
* keep develop's effective_dist clamp [0, 2] (closet-boost-overshoot
  fix) but compute and report the unclamped similarity so the filter
  contract still holds
* extract _validate_min_similarity (raises) and _drop_below_threshold
  helpers, and split search_memories into a wrapper + impl so the
  validation and post-rerank filter live outside the inner pipeline
  without breaching the cyclomatic-complexity ceiling
* apply the threshold to the post-rerank result list rather than the
  pre-rerank candidate pool, so legitimate BM25 lexical lifts are not
  dropped before they get a chance to be reranked
* update the develop BM25 regression test to use distance=1.0 (not
  1.5), matching its stated intent of cosine=0 in the now-unclamped
  display

Full pytest suite (1558 tests) passes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/mcp MCP server and tools area/search Search and retrieval enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants