fix: mitigate system prompt contamination in search queries (#333)#385
Conversation
…e#333) Addresses Issue MemPalace#333: AI agents prepending system prompts to search queries causes embedding retrieval to collapse (89.8% → 1.0% R@10). Mitigation approach (減災): - New query_sanitizer.py with 4-stage pipeline: Step 1: passthrough for short queries (≤200 chars) Step 2: question extraction (finds ? sentences) → ~85-89% recovery Step 3: tail sentence extraction → ~80-89% recovery Step 4: tail truncation fallback → ~70-80% recovery Worst case without sanitizer: 1.0% (catastrophic) Worst case with sanitizer: ~70-80% (survivable) - mcp_server.py: tool_search applies sanitizer before ChromaDB query - MCP schema: query description warns agents not to include prompts - New 'context' parameter separates background info from search intent - Sanitizer metadata included in response when triggered 22 new tests covering all pipeline stages and real-world scenarios. Made-with: Cursor
|
@matrix9neonebuchadnezzar2199-sketch pls fix lint |
bensig
left a comment
There was a problem hiding this comment.
fix lint, ready to merge
web3guru888
left a comment
There was a problem hiding this comment.
Real-world validation — we hit this exact cliff
We independently built a _isolate_query() function in our integration after #333 nearly silently killed our cross-domain discovery pipeline. 208 discoveries across 5 domains, and one morning our Orient phase (broad cross-domain sweeps) was returning garbage — similarity scores looked normal (~0.7-0.8) but every result was semantically wrong. It took hours to trace back to system prompt contamination because the scores don't collapse, they just become meaningless (the embedding is coherent, just representing the wrong text).
So: strong +1 on the 減災 philosophy. This is the right framing.
How your approach compares to ours
Our _isolate_query() does roughly the same thing but with different heuristics:
- We detect known system prompt signatures (e.g., "You are", "MEMORY.md", "## ", markdown headers) and strip everything before the last occurrence
- We have a character-ratio heuristic: if the query is >3× longer than the median query length for that session, assume contamination
- We use a newline-based split similar to your Step 3, but we look for the last segment that doesn't match known prompt patterns rather than just taking the tail
Your 4-stage pipeline is more principled than our pattern-matching approach. A few observations:
What works well
-
The passthrough gate (≤200 chars) — this is critical. Our telemetry shows ~85% of queries are under 100 chars. Zero overhead for the common case.
-
Question mark extraction as Step 2 — simple and effective. In our experience, the actual user query almost always ends with
?when it's a retrieval question. Your approach of scanning backwards (reversed(all_segments)) is the right direction since system prompts are prepended. -
The
contextparameter in MCP schema — this is the real long-term fix. By giving agents a structured place to put background info, well-behaved agents (Claude 3.5+, GPT-4) will use it, and thequeryfield stays clean. Defense in depth: schema guidance for cooperative agents + sanitizer for uncooperative ones. -
Sanitizer metadata in response —
query_sanitized: true+ method is invaluable for debugging. We logged our_isolate_query()interventions and discovered that one of our OODA phases was contaminating 40% of queries. Without the metadata, we'd never have found it.
Edge cases from our experience
-
Multi-line system prompts with
?in them: We've seen system prompts that contain questions like "Are you sure you want to proceed?" or "What tools are available?". Your Step 2 scans backward and takes the last question, which helps, but if the system prompt ends with a rhetorical question after the real query (some agent frameworks append "Is there anything else?"), the sanitizer would extract the wrong sentence. We handle this with a blocklist of common rhetorical questions. Might be worth considering for a follow-up. -
Queries that are legitimately long: Research queries can be 300+ chars without contamination — e.g., "What are the thermodynamic constraints on photosynthetic efficiency in C4 plants under elevated CO2 concentrations and how do they compare to C3 pathways?" (168 chars, but domain-specific queries in our astrophysics wing can be longer). The 200-char
SAFE_QUERY_LENGTHthreshold means these hit the sanitizer unnecessarily. Not harmful (Step 2/3 will extract the right thing) but generates false-positivewas_sanitized=truemetadata. A minor concern — just noting it. -
The
maxLength: 500on the schema — some MCP clients silently truncate atmaxLength. If a 600-char query gets truncated to 500 by the client before the sanitizer sees it, the tail (the actual query) may be cut off. Our approach is to not setmaxLengthon the schema and let the sanitizer handle length internally. Worth checking how Claude Desktop and other MCP clients handlemaxLength.
Minor code notes
-
query_sanitizer.pyline ~90: the_SENTENCE_SPLITregex splits on.which will incorrectly split on decimal numbers (e.g., "R@10 dropped to 1.0% after contamination"). Unlikely to cause real issues since you then checkMIN_QUERY_LENGTH, but worth noting. -
The
contextparameter is accepted but not used yet (result["context_received"] = Trueand nothing else). The PR description says "for future re-ranking" — might be worth adding a TODO comment in the code so it's not forgotten.
Summary
This is a clean, well-tested mitigation for a genuinely nasty silent failure. The 4-stage pipeline with graceful fallback is better architecture than our pattern-matching approach — we'll likely refactor our _isolate_query() to match this structure. The context parameter is the right long-term direction.
Main suggestion: consider the maxLength client-side truncation risk. Everything else is solid.
… PRs Resolves merge conflict in mcp_server.py by keeping our improvements (cached metadata, inode detection, WAL rotation, max_distance) and integrating upstream's query_sanitizer (MemPalace#385) and context parameter. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Closes #333
Summary
Mitigate the silent retrieval collapse caused by system prompt contamination in
mempalace_searchqueries. This is a 減災 (mitigation) approach — not perfect prevention, but it prevents the catastrophic cliff.Problem
When AI agents prepend system prompts (2000+ chars) to search queries, the embedding vector represents the system prompt instead of the actual question. Retrieval precision collapses from 89.8% to 1.0% R@10 — with no errors thrown and normal-looking scores. Every question type is affected; architecture and cross-reference queries drop to 0.0%.
This affects any MCP integration where the full conversation context reaches
mempalace_search, which is the default behavior of most AI agents.Solution: 4-Stage Sanitizer Pipeline
New
query_sanitizer.pyprocesses queries before they reach ChromaDB:??) → Yes → extract question sentenceExpected recovery
?sentence, near-full recoveryWorst case drops from 1.0% to ~70-80% — the cliff is eliminated.
MCP layer changes (
mcp_server.py)tool_searchappliessanitize_query()before passing tosearch_memories()contextparameter: agents can separate background info from search intentqueryfield includesmaxLength: 500query_sanitized: trueand asanitizermetadata block for debuggingWhat is NOT changed
searcher.py/search_memories()— untouched. Sanitization is the MCP layer's responsibility; the search engine stays pure.search()— no contamination risk from direct user input.Defense in depth
Testing
22 new tests in
tests/test_query_sanitizer.pycovering:?, Japanese?, multiple questions, question in system promptAll existing tests pass (121 passed, 2 pre-existing Windows-only failures unrelated to this change).
Design note: 減災 (Mitigation)
This PR adopts a "disaster mitigation" philosophy rather than attempting full prevention. Complete prevention is impossible because MemPalace cannot control what AI agents put in the
queryparameter — the MCP protocol passes it as a plain string with no structural boundary between "system prompt" and "question."Instead, we minimize damage: the sanitizer ensures that even contaminated queries produce usable results rather than silently returning wrong answers. The 4-stage pipeline degrades gracefully — each fallback stage recovers less precision but always stays far above the 1.0% cliff.
Related: #335 (MemPalace-AGI confirmed this issue affects their OODA pipeline)