-
Notifications
You must be signed in to change notification settings - Fork 2
fix(mcp): embedding guard centroid drifts toward attacker content over time (boiling frog) #2311
Copy link
Copy link
Closed
Labels
P2High value, medium complexityHigh value, medium complexitysecuritySecurity-related issueSecurity-related issue
Description
Problem
The per-server centroid in EmbeddingAnomalyGuard is updated with every output flagged as clean. A patient attacker can gradually shift the centroid toward malicious content by sending subtly crafted outputs that stay just below the anomaly threshold, eventually making injections appear normal.
Root cause
EmbeddingAnomalyGuard::update_centroid() accepts all non-anomalous outputs as clean training data. There is no verification that the content is actually clean — only that it was not flagged anomalous by the current (potentially already drifted) threshold.
Proposed fix
- Cap centroid update rate: apply Bayesian weighting so recent samples have diminishing influence once centroid is stable
- Periodic centroid re-anchoring from a trusted baseline (e.g., system prompt embeddings)
- Optionally: route centroid updates through the response verifier before accepting as clean
Priority
P2 — affects the long-term reliability of the embedding anomaly guard but requires sustained attacker access.
Related: PR #2310
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P2High value, medium complexityHigh value, medium complexitysecuritySecurity-related issueSecurity-related issue