System prompt context prepended to queries drops retrieval from 89.8% to 1.0%

## What happens

Prepending system prompt content (wake-up context, MEMORY.md, or any persistent context block) to embedding queries before ChromaDB search causes near-total retrieval failure. No errors thrown. Results come back. They're just wrong.

## Numbers

Tested on a 401-question benchmark (6 question types, 678 corpus documents) using `all-MiniLM-L6-v2` embeddings in ChromaDB.

| Config | R@1 | R@5 | R@10 |
|--------|-----|-----|------|
| Baseline (no system prompt) | 63.3% | 84.3% | 89.8% |
| 2000 chars prepended to query | 0.5% | 0.5% | 1.0% |

Per-type breakdown at R@10 with system prompt prepended:

| Type | Baseline | With system prompt |
|------|----------|--------------------|
| architecture (n=56) | 98.2% | 0.0% |
| cross-reference (n=80) | 97.5% | 0.0% |
| rule-recall (n=91) | 91.2% | 1.1% |
| preference-feedback (n=70) | 85.7% | 1.4% |
| project-state (n=100) | 83.0% | 2.0% |
| temporal (n=4) | 25.0% | 0.0% |

Every question type collapses. Architecture and cross-reference go to zero.

## Why it happens

The embedding model represents the concatenated string as a single vector. 2000 chars of system prompt overwhelms the actual question (typically 10-50 chars). The resulting vector represents the system prompt, not the query. ChromaDB returns the nearest neighbors to the system prompt, not to the question.

### Direct proof (cosine similarity)

Embedding a single question, the system prompt alone, and the system prompt + question concatenated:

```
cosine(clean_question, dirty_question)  = 0.4059
cosine(system_prompt,  dirty_question)  = 1.0000
cosine(clean_question, system_prompt)   = 0.4059
```

The prepended query vector is identical to the system prompt vector. The question contributes nothing.

### Degradation curve (averaged over 5 questions, MiniLM L6 v2)

| Prepend chars | cos(question, dirty) | cos(sysprompt, dirty) |
|---------------|---------------------|-----------------------|
| 0 (control) | 1.0000 | n/a |
| 100 | 0.6250 | 0.7684 |
| 250 | 0.5705 | 0.8088 |
| 500 | 0.5210 | 0.8802 |
| 1000 | 0.1404 | 1.0000 |
| 2000 | 0.1404 | 1.0000 |
| 5000 | 0.1404 | 1.0000 |

At 1000 chars the question is completely erased from the vector. At 500 chars the system prompt already dominates. MiniLM's 256-token context window fills up and the question gets truncated.

### Confirmed on a second model (Nomic embed-text, 8192-token window)

| Prepend chars | cos(question, dirty) | cos(sysprompt, dirty) |
|---------------|---------------------|-----------------------|
| 100 | 0.7552 | 0.8964 |
| 500 | 0.6827 | 0.9518 |
| 1000 | 0.5898 | 0.9825 |
| 2000 | 0.5488 | 0.9925 |
| 5000 | 0.5048 | 0.9987 |

The larger context window delays the cliff but doesn't prevent it. At 5000 chars, the question is 0.13% of the signal. The degradation is smoother (no hard truncation) but the destination is the same: the system prompt becomes the embedding.

## Why it matters for MemPalace

`mempalace wake-up` is designed to inject context into the system prompt. The README instructs users to paste wake-up output into "your local model's system prompt." If that context reaches the embedding query (which it will in any MCP integration where the full conversation context is passed to the tool), retrieval fails silently.

This also affects any architecture where an MCP server receives the full system prompt alongside the user's question and concatenates them before searching. The failure mode is invisible: the search returns results, the scores look normal, but the results are wrong.

## Reproduction

```bash
# Baseline
python benchmarks/cogpros_hybrid_bench.py data/cogpros_benchmark_v3.json \
  --no-expand --no-type-filter

# With system prompt prepended (any text file, 2000+ chars)
python benchmarks/cogpros_hybrid_bench.py data/cogpros_benchmark_v3.json \
  --no-expand --no-type-filter \
  --wake-up /path/to/your/system-prompt.md
```

Reproduced twice (2026-04-07 and 2026-04-08). Run configs embedded in output JSONL.

## Fix direction

Query text for embedding retrieval must be isolated from system prompt content. If an MCP tool receives `system_prompt + question`, it needs to strip the system prompt before searching. This is a design constraint, not a tuning problem.

## Related issues

- #214 (benchmark credibility)
- #125 (BEAM end-to-end eval)
- #27 (systematic analysis)

## Additional findings from the same benchmark suite

| Config | R@10 | Notes |
|--------|------|-------|
| ChromaDB + type filter | 91.0% | Best simple config |
| ChromaDB + query expansion (MiniLM) | 89.8% | Baseline used above |
| GemmaEmbed + Qwen reranker | 92.8% | Best overall, 14.8s/query |
| Chunking (700/2100) | 85.3% | Fragments context, hurts |
| AAAK compression | 71.8% | Confirms #125 AAAK findings |
| Keyword overlap fusion | 89.8% | Adds nothing over embedding alone |
| BM25 | ~2% | Dead on natural language queries |

Document length is the #1 predictor of retrieval failure. Under 500 chars: 98.8%. Over 3000 chars: 76.7%.

## Open research questions (help wanted)

Tested on two models so far:

- `all-MiniLM-L6-v2` (256-token window): hard cliff at ~1000 chars. Question completely erased.
- `nomic-embed-text` (8192-token window): gradual degradation. At 5000 chars, question is 0.13% of signal.

The mechanism is confirmed across both: longer context windows delay the failure but don't prevent it. Still untested:

- OpenAI `text-embedding-3-small` / `text-embedding-3-large` (8191 token window)
- Cohere `embed-v3` (512 token window)
- Voyage AI embeddings
- Any model with a longer context window

The reproduction is simple: embed a question alone, embed it with N chars prepended, measure cosine similarity. Five lines of code. The degradation curve will show you exactly where the cliff falls for your model.


---

**Blog post with full methodology:** https://cogpros.github.io/cogprosthetics-blog/2026/04/08/system-prompts-kill-retrieval.html


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

System prompt context prepended to queries drops retrieval from 89.8% to 1.0% #333

What happens

Numbers

Why it happens

Direct proof (cosine similarity)

Degradation curve (averaged over 5 questions, MiniLM L6 v2)

Confirmed on a second model (Nomic embed-text, 8192-token window)

Why it matters for MemPalace

Reproduction

Fix direction

Related issues

Additional findings from the same benchmark suite

Open research questions (help wanted)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Config	R@1	R@5	R@10
Baseline (no system prompt)	63.3%	84.3%	89.8%
2000 chars prepended to query	0.5%	0.5%	1.0%

Type	Baseline	With system prompt
architecture (n=56)	98.2%	0.0%
cross-reference (n=80)	97.5%	0.0%
rule-recall (n=91)	91.2%	1.1%
preference-feedback (n=70)	85.7%	1.4%
project-state (n=100)	83.0%	2.0%
temporal (n=4)	25.0%	0.0%

Prepend chars	cos(question, dirty)	cos(sysprompt, dirty)
0 (control)	1.0000	n/a
100	0.6250	0.7684
250	0.5705	0.8088
500	0.5210	0.8802
1000	0.1404	1.0000
2000	0.1404	1.0000
5000	0.1404	1.0000

Prepend chars	cos(question, dirty)	cos(sysprompt, dirty)
100	0.7552	0.8964
500	0.6827	0.9518
1000	0.5898	0.9825
2000	0.5488	0.9925
5000	0.5048	0.9987

Config	R@10	Notes
ChromaDB + type filter	91.0%	Best simple config
ChromaDB + query expansion (MiniLM)	89.8%	Baseline used above
GemmaEmbed + Qwen reranker	92.8%	Best overall, 14.8s/query
Chunking (700/2100)	85.3%	Fragments context, hurts
AAAK compression	71.8%	Confirms #125 AAAK findings
Keyword overlap fusion	89.8%	Adds nothing over embedding alone
BM25	~2%	Dead on natural language queries

System prompt context prepended to queries drops retrieval from 89.8% to 1.0% #333

Description

What happens

Numbers

Why it happens

Direct proof (cosine similarity)

Degradation curve (averaged over 5 questions, MiniLM L6 v2)

Confirmed on a second model (Nomic embed-text, 8192-token window)

Why it matters for MemPalace

Reproduction

Fix direction

Related issues

Additional findings from the same benchmark suite

Open research questions (help wanted)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions