Skip to content

System prompt context prepended to queries drops retrieval from 89.8% to 1.0% #333

@cogpros

Description

@cogpros

What happens

Prepending system prompt content (wake-up context, MEMORY.md, or any persistent context block) to embedding queries before ChromaDB search causes near-total retrieval failure. No errors thrown. Results come back. They're just wrong.

Numbers

Tested on a 401-question benchmark (6 question types, 678 corpus documents) using all-MiniLM-L6-v2 embeddings in ChromaDB.

Config R@1 R@5 R@10
Baseline (no system prompt) 63.3% 84.3% 89.8%
2000 chars prepended to query 0.5% 0.5% 1.0%

Per-type breakdown at R@10 with system prompt prepended:

Type Baseline With system prompt
architecture (n=56) 98.2% 0.0%
cross-reference (n=80) 97.5% 0.0%
rule-recall (n=91) 91.2% 1.1%
preference-feedback (n=70) 85.7% 1.4%
project-state (n=100) 83.0% 2.0%
temporal (n=4) 25.0% 0.0%

Every question type collapses. Architecture and cross-reference go to zero.

Why it happens

The embedding model represents the concatenated string as a single vector. 2000 chars of system prompt overwhelms the actual question (typically 10-50 chars). The resulting vector represents the system prompt, not the query. ChromaDB returns the nearest neighbors to the system prompt, not to the question.

Direct proof (cosine similarity)

Embedding a single question, the system prompt alone, and the system prompt + question concatenated:

cosine(clean_question, dirty_question)  = 0.4059
cosine(system_prompt,  dirty_question)  = 1.0000
cosine(clean_question, system_prompt)   = 0.4059

The prepended query vector is identical to the system prompt vector. The question contributes nothing.

Degradation curve (averaged over 5 questions, MiniLM L6 v2)

Prepend chars cos(question, dirty) cos(sysprompt, dirty)
0 (control) 1.0000 n/a
100 0.6250 0.7684
250 0.5705 0.8088
500 0.5210 0.8802
1000 0.1404 1.0000
2000 0.1404 1.0000
5000 0.1404 1.0000

At 1000 chars the question is completely erased from the vector. At 500 chars the system prompt already dominates. MiniLM's 256-token context window fills up and the question gets truncated.

Confirmed on a second model (Nomic embed-text, 8192-token window)

Prepend chars cos(question, dirty) cos(sysprompt, dirty)
100 0.7552 0.8964
500 0.6827 0.9518
1000 0.5898 0.9825
2000 0.5488 0.9925
5000 0.5048 0.9987

The larger context window delays the cliff but doesn't prevent it. At 5000 chars, the question is 0.13% of the signal. The degradation is smoother (no hard truncation) but the destination is the same: the system prompt becomes the embedding.

Why it matters for MemPalace

mempalace wake-up is designed to inject context into the system prompt. The README instructs users to paste wake-up output into "your local model's system prompt." If that context reaches the embedding query (which it will in any MCP integration where the full conversation context is passed to the tool), retrieval fails silently.

This also affects any architecture where an MCP server receives the full system prompt alongside the user's question and concatenates them before searching. The failure mode is invisible: the search returns results, the scores look normal, but the results are wrong.

Reproduction

# Baseline
python benchmarks/cogpros_hybrid_bench.py data/cogpros_benchmark_v3.json \
  --no-expand --no-type-filter

# With system prompt prepended (any text file, 2000+ chars)
python benchmarks/cogpros_hybrid_bench.py data/cogpros_benchmark_v3.json \
  --no-expand --no-type-filter \
  --wake-up /path/to/your/system-prompt.md

Reproduced twice (2026-04-07 and 2026-04-08). Run configs embedded in output JSONL.

Fix direction

Query text for embedding retrieval must be isolated from system prompt content. If an MCP tool receives system_prompt + question, it needs to strip the system prompt before searching. This is a design constraint, not a tuning problem.

Related issues

Additional findings from the same benchmark suite

Config R@10 Notes
ChromaDB + type filter 91.0% Best simple config
ChromaDB + query expansion (MiniLM) 89.8% Baseline used above
GemmaEmbed + Qwen reranker 92.8% Best overall, 14.8s/query
Chunking (700/2100) 85.3% Fragments context, hurts
AAAK compression 71.8% Confirms #125 AAAK findings
Keyword overlap fusion 89.8% Adds nothing over embedding alone
BM25 ~2% Dead on natural language queries

Document length is the #1 predictor of retrieval failure. Under 500 chars: 98.8%. Over 3000 chars: 76.7%.

Open research questions (help wanted)

Tested on two models so far:

  • all-MiniLM-L6-v2 (256-token window): hard cliff at ~1000 chars. Question completely erased.
  • nomic-embed-text (8192-token window): gradual degradation. At 5000 chars, question is 0.13% of signal.

The mechanism is confirmed across both: longer context windows delay the failure but don't prevent it. Still untested:

  • OpenAI text-embedding-3-small / text-embedding-3-large (8191 token window)
  • Cohere embed-v3 (512 token window)
  • Voyage AI embeddings
  • Any model with a longer context window

The reproduction is simple: embed a question alone, embed it with N chars prepended, measure cosine similarity. Five lines of code. The degradation curve will show you exactly where the cliff falls for your model.


Blog post with full methodology: https://cogpros.github.io/cogprosthetics-blog/2026/04/08/system-prompts-kill-retrieval.html

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions