Skip to content

fix(mcp): embedding guard centroid drifts toward attacker content over time (boiling frog) #2311

@bug-ops

Description

@bug-ops

Problem

The per-server centroid in EmbeddingAnomalyGuard is updated with every output flagged as clean. A patient attacker can gradually shift the centroid toward malicious content by sending subtly crafted outputs that stay just below the anomaly threshold, eventually making injections appear normal.

Root cause

EmbeddingAnomalyGuard::update_centroid() accepts all non-anomalous outputs as clean training data. There is no verification that the content is actually clean — only that it was not flagged anomalous by the current (potentially already drifted) threshold.

Proposed fix

  • Cap centroid update rate: apply Bayesian weighting so recent samples have diminishing influence once centroid is stable
  • Periodic centroid re-anchoring from a trusted baseline (e.g., system prompt embeddings)
  • Optionally: route centroid updates through the response verifier before accepting as clean

Priority

P2 — affects the long-term reliability of the embedding anomaly guard but requires sustained attacker access.

Related: PR #2310

Metadata

Metadata

Assignees

Labels

P2High value, medium complexitysecuritySecurity-related issue

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions