What happens
Prepending system prompt content (wake-up context, MEMORY.md, or any persistent context block) to embedding queries before ChromaDB search causes near-total retrieval failure. No errors thrown. Results come back. They're just wrong.
Numbers
Tested on a 401-question benchmark (6 question types, 678 corpus documents) using all-MiniLM-L6-v2 embeddings in ChromaDB.
| Config |
R@1 |
R@5 |
R@10 |
| Baseline (no system prompt) |
63.3% |
84.3% |
89.8% |
| 2000 chars prepended to query |
0.5% |
0.5% |
1.0% |
Per-type breakdown at R@10 with system prompt prepended:
| Type |
Baseline |
With system prompt |
| architecture (n=56) |
98.2% |
0.0% |
| cross-reference (n=80) |
97.5% |
0.0% |
| rule-recall (n=91) |
91.2% |
1.1% |
| preference-feedback (n=70) |
85.7% |
1.4% |
| project-state (n=100) |
83.0% |
2.0% |
| temporal (n=4) |
25.0% |
0.0% |
Every question type collapses. Architecture and cross-reference go to zero.
Why it happens
The embedding model represents the concatenated string as a single vector. 2000 chars of system prompt overwhelms the actual question (typically 10-50 chars). The resulting vector represents the system prompt, not the query. ChromaDB returns the nearest neighbors to the system prompt, not to the question.
Direct proof (cosine similarity)
Embedding a single question, the system prompt alone, and the system prompt + question concatenated:
cosine(clean_question, dirty_question) = 0.4059
cosine(system_prompt, dirty_question) = 1.0000
cosine(clean_question, system_prompt) = 0.4059
The prepended query vector is identical to the system prompt vector. The question contributes nothing.
Degradation curve (averaged over 5 questions, MiniLM L6 v2)
| Prepend chars |
cos(question, dirty) |
cos(sysprompt, dirty) |
| 0 (control) |
1.0000 |
n/a |
| 100 |
0.6250 |
0.7684 |
| 250 |
0.5705 |
0.8088 |
| 500 |
0.5210 |
0.8802 |
| 1000 |
0.1404 |
1.0000 |
| 2000 |
0.1404 |
1.0000 |
| 5000 |
0.1404 |
1.0000 |
At 1000 chars the question is completely erased from the vector. At 500 chars the system prompt already dominates. MiniLM's 256-token context window fills up and the question gets truncated.
Confirmed on a second model (Nomic embed-text, 8192-token window)
| Prepend chars |
cos(question, dirty) |
cos(sysprompt, dirty) |
| 100 |
0.7552 |
0.8964 |
| 500 |
0.6827 |
0.9518 |
| 1000 |
0.5898 |
0.9825 |
| 2000 |
0.5488 |
0.9925 |
| 5000 |
0.5048 |
0.9987 |
The larger context window delays the cliff but doesn't prevent it. At 5000 chars, the question is 0.13% of the signal. The degradation is smoother (no hard truncation) but the destination is the same: the system prompt becomes the embedding.
Why it matters for MemPalace
mempalace wake-up is designed to inject context into the system prompt. The README instructs users to paste wake-up output into "your local model's system prompt." If that context reaches the embedding query (which it will in any MCP integration where the full conversation context is passed to the tool), retrieval fails silently.
This also affects any architecture where an MCP server receives the full system prompt alongside the user's question and concatenates them before searching. The failure mode is invisible: the search returns results, the scores look normal, but the results are wrong.
Reproduction
# Baseline
python benchmarks/cogpros_hybrid_bench.py data/cogpros_benchmark_v3.json \
--no-expand --no-type-filter
# With system prompt prepended (any text file, 2000+ chars)
python benchmarks/cogpros_hybrid_bench.py data/cogpros_benchmark_v3.json \
--no-expand --no-type-filter \
--wake-up /path/to/your/system-prompt.md
Reproduced twice (2026-04-07 and 2026-04-08). Run configs embedded in output JSONL.
Fix direction
Query text for embedding retrieval must be isolated from system prompt content. If an MCP tool receives system_prompt + question, it needs to strip the system prompt before searching. This is a design constraint, not a tuning problem.
Related issues
Additional findings from the same benchmark suite
| Config |
R@10 |
Notes |
| ChromaDB + type filter |
91.0% |
Best simple config |
| ChromaDB + query expansion (MiniLM) |
89.8% |
Baseline used above |
| GemmaEmbed + Qwen reranker |
92.8% |
Best overall, 14.8s/query |
| Chunking (700/2100) |
85.3% |
Fragments context, hurts |
| AAAK compression |
71.8% |
Confirms #125 AAAK findings |
| Keyword overlap fusion |
89.8% |
Adds nothing over embedding alone |
| BM25 |
~2% |
Dead on natural language queries |
Document length is the #1 predictor of retrieval failure. Under 500 chars: 98.8%. Over 3000 chars: 76.7%.
Open research questions (help wanted)
Tested on two models so far:
all-MiniLM-L6-v2 (256-token window): hard cliff at ~1000 chars. Question completely erased.
nomic-embed-text (8192-token window): gradual degradation. At 5000 chars, question is 0.13% of signal.
The mechanism is confirmed across both: longer context windows delay the failure but don't prevent it. Still untested:
- OpenAI
text-embedding-3-small / text-embedding-3-large (8191 token window)
- Cohere
embed-v3 (512 token window)
- Voyage AI embeddings
- Any model with a longer context window
The reproduction is simple: embed a question alone, embed it with N chars prepended, measure cosine similarity. Five lines of code. The degradation curve will show you exactly where the cliff falls for your model.
Blog post with full methodology: https://cogpros.github.io/cogprosthetics-blog/2026/04/08/system-prompts-kill-retrieval.html
What happens
Prepending system prompt content (wake-up context, MEMORY.md, or any persistent context block) to embedding queries before ChromaDB search causes near-total retrieval failure. No errors thrown. Results come back. They're just wrong.
Numbers
Tested on a 401-question benchmark (6 question types, 678 corpus documents) using
all-MiniLM-L6-v2embeddings in ChromaDB.Per-type breakdown at R@10 with system prompt prepended:
Every question type collapses. Architecture and cross-reference go to zero.
Why it happens
The embedding model represents the concatenated string as a single vector. 2000 chars of system prompt overwhelms the actual question (typically 10-50 chars). The resulting vector represents the system prompt, not the query. ChromaDB returns the nearest neighbors to the system prompt, not to the question.
Direct proof (cosine similarity)
Embedding a single question, the system prompt alone, and the system prompt + question concatenated:
The prepended query vector is identical to the system prompt vector. The question contributes nothing.
Degradation curve (averaged over 5 questions, MiniLM L6 v2)
At 1000 chars the question is completely erased from the vector. At 500 chars the system prompt already dominates. MiniLM's 256-token context window fills up and the question gets truncated.
Confirmed on a second model (Nomic embed-text, 8192-token window)
The larger context window delays the cliff but doesn't prevent it. At 5000 chars, the question is 0.13% of the signal. The degradation is smoother (no hard truncation) but the destination is the same: the system prompt becomes the embedding.
Why it matters for MemPalace
mempalace wake-upis designed to inject context into the system prompt. The README instructs users to paste wake-up output into "your local model's system prompt." If that context reaches the embedding query (which it will in any MCP integration where the full conversation context is passed to the tool), retrieval fails silently.This also affects any architecture where an MCP server receives the full system prompt alongside the user's question and concatenates them before searching. The failure mode is invisible: the search returns results, the scores look normal, but the results are wrong.
Reproduction
Reproduced twice (2026-04-07 and 2026-04-08). Run configs embedded in output JSONL.
Fix direction
Query text for embedding retrieval must be isolated from system prompt content. If an MCP tool receives
system_prompt + question, it needs to strip the system prompt before searching. This is a design constraint, not a tuning problem.Related issues
Additional findings from the same benchmark suite
Document length is the #1 predictor of retrieval failure. Under 500 chars: 98.8%. Over 3000 chars: 76.7%.
Open research questions (help wanted)
Tested on two models so far:
all-MiniLM-L6-v2(256-token window): hard cliff at ~1000 chars. Question completely erased.nomic-embed-text(8192-token window): gradual degradation. At 5000 chars, question is 0.13% of signal.The mechanism is confirmed across both: longer context windows delay the failure but don't prevent it. Still untested:
text-embedding-3-small/text-embedding-3-large(8191 token window)embed-v3(512 token window)The reproduction is simple: embed a question alone, embed it with N chars prepended, measure cosine similarity. Five lines of code. The degradation curve will show you exactly where the cliff falls for your model.
Blog post with full methodology: https://cogpros.github.io/cogprosthetics-blog/2026/04/08/system-prompts-kill-retrieval.html