Skip to content

feat(security): IPI defense — DeBERTa soft-signal, three-class AlignSentinel, TurnCausalAnalyzer#2369

Merged
bug-ops merged 2 commits intomainfrom
deberta-ipi-classifier-fpr
Mar 28, 2026
Merged

feat(security): IPI defense — DeBERTa soft-signal, three-class AlignSentinel, TurnCausalAnalyzer#2369
bug-ops merged 2 commits intomainfrom
deberta-ipi-classifier-fpr

Conversation

@bug-ops
Copy link
Copy Markdown
Owner

@bug-ops bug-ops commented Mar 28, 2026

Summary

Implements three complementary IPI defense enhancements from research issues #2193, #2208, #2335:

Bootstrap wiring added to runner.rs, daemon.rs, acp.rs. Pre-existing zeph-sanitizer/guardrail feature forwarding bug fixed as a side effect.

Test plan

  • cargo +nightly fmt --check passes
  • cargo clippy --workspace --features full -- -D warnings — 0 warnings
  • cargo nextest run --workspace --features full --lib --bins — 7073/7073 passed
  • Warn mode test: classify_injection_warn_mode_above_threshold_returns_suspicious
  • Two-stage pipeline tests: aligned→Clean, misaligned→Blocked, error→binary fallback
  • Threshold serde validation tests

Follow-up issues to file

  • acp.rs inlines wiring instead of shared helpers (DRY gap)
  • NoInstruction downgrade confidence threshold (IM-1)
  • SNIPPET_MAX_BYTES doc says "chars" not "bytes" (IM-2)
  • Probe error/timeout path tests and native.rs causal integration tests
  • Suspicious verdict not in SecurityEvent enum

…entinel, TurnCausalAnalyzer (#2193, #2208, #2335)

Add three complementary layers to Zeph's indirect prompt injection defense:

- InjectionEnforcementMode (Warn/Block) for DeBERTa classifier: default Warn
  mode returns Suspicious instead of Blocked, preventing high FPR from
  off-the-shelf models from disrupting legitimate tool operations. All
  fallback paths (regex, error, timeout) respect the enforcement mode via
  regex_verdict() helper.

- CandleThreeClassClassifier: two-stage pipeline that runs binary detection
  first, then refines positive hits with a three-class model
  (misaligned-instruction / aligned-instruction / no-instruction). Aligned
  instructions are downgraded to Clean, substantially reducing FPR.
  Dynamic id2label from model config.json; load failures allow retry
  (not permanently cached).

- TurnCausalAnalyzer: per-batch LLM probes at tool-return boundaries
  compute behavioral deviation via normalized Levenshtein + Jaccard.
  Probe responses bounded by probe_max_chars. Never blocks — emits
  metric ipi.causal_deviation and SecurityEvent on threshold crossing.
  Config: [security.causal_ipi] enabled=false, threshold=0.7, provider="fast".

Bootstrap wiring in runner.rs, daemon.rs, acp.rs via apply_enforcement_mode(),
apply_three_class_classifier(), apply_causal_analyzer(). Pre-existing
zeph-sanitizer/guardrail feature forwarding bug fixed as a side effect.
@bug-ops bug-ops force-pushed the deberta-ipi-classifier-fpr branch from a6d9992 to d432809 Compare March 28, 2026 18:48
@bug-ops bug-ops enabled auto-merge (squash) March 28, 2026 18:50
@bug-ops bug-ops merged commit b27cf4f into main Mar 28, 2026
25 checks passed
@bug-ops bug-ops deleted the deberta-ipi-classifier-fpr branch March 28, 2026 19:13
bug-ops added a commit that referenced this pull request Mar 30, 2026
…PI duplication

- Populate InitializeResponse.auth_methods with [{type: agent, id: zeph}] using
  the typed builder; previously returned authMethods: [] which blocked ACP Registry
  inclusion (#2422)
- Serve GET /agent.json with agent identity manifest (id, name, version, description,
  distribution) for ACP Registry discovery; gated on discovery_enabled (#2422)
- Extract apply_three_class_classifier_with_cfg and apply_causal_analyzer_with_cfg
  helpers in agent_setup.rs; acp.rs now delegates instead of inlining construction
  eliminating the DRY gap from #2369 (#2370)
- discovery.rs already reflects ProtocolVersion::LATEST since PR #2423 (#2412)

Closes #2422, closes #2370
bug-ops added a commit that referenced this pull request Mar 30, 2026
…PI duplication (#2431)

- Populate InitializeResponse.auth_methods with [{type: agent, id: zeph}] using
  the typed builder; previously returned authMethods: [] which blocked ACP Registry
  inclusion (#2422)
- Serve GET /agent.json with agent identity manifest (id, name, version, description,
  distribution) for ACP Registry discovery; gated on discovery_enabled (#2422)
- Extract apply_three_class_classifier_with_cfg and apply_causal_analyzer_with_cfg
  helpers in agent_setup.rs; acp.rs now delegates instead of inlining construction
  eliminating the DRY gap from #2369 (#2370)
- discovery.rs already reflects ProtocolVersion::LATEST since PR #2423 (#2412)

Closes #2422, closes #2370
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core zeph-core crate dependencies Dependency updates documentation Improvements or additions to documentation enhancement New feature or request llm zeph-llm crate (Ollama, Claude) rust Rust code changes size/XL Extra large PR (500+ lines)

Projects

None yet

1 participant