feat(security): IPI defense — DeBERTa soft-signal, three-class AlignSentinel, TurnCausalAnalyzer by bug-ops · Pull Request #2369 · bug-ops/zeph

bug-ops · 2026-03-28T18:46:56Z

Summary

Implements three complementary IPI defense enhancements from research issues #2193, #2208, #2335:

research(security): MELON paper — DeBERTa injection detectors have high FPR; use as soft signal only (arXiv:2502.05174) #2193 (MELON): InjectionEnforcementMode (Warn/Block) — default Warn returns Suspicious instead of Blocked, preventing high DeBERTa FPR from blocking legitimate tool operations. All fallback paths (regex, error, timeout) respect the mode via regex_verdict() helper.
research(security): AlignSentinel — alignment-aware DeBERTa-v3 classifier reduces FPR on benign tool outputs (arXiv:2602.13597) #2208 (AlignSentinel): CandleThreeClassClassifier two-stage pipeline — binary detection first, three-class refinement (misaligned/aligned/no-instruction) on positive hits. Aligned instructions downgraded to Clean. Dynamic id2label from model config.json; load failures allow retry.
research(security): AgentSentry temporal causal diagnostics — turn-level IPI attribution + context purification, +20-33pp utility under attack (arXiv:2602.22724) #2335 (AgentSentry): TurnCausalAnalyzer — per-batch LLM probes at tool-return boundaries, local Levenshtein+Jaccard deviation scoring. Never blocks; emits ipi.causal_deviation metric + SecurityEvent on threshold crossing. Config: [security.causal_ipi].

Bootstrap wiring added to runner.rs, daemon.rs, acp.rs. Pre-existing zeph-sanitizer/guardrail feature forwarding bug fixed as a side effect.

Test plan

cargo +nightly fmt --check passes
cargo clippy --workspace --features full -- -D warnings — 0 warnings
cargo nextest run --workspace --features full --lib --bins — 7073/7073 passed
Warn mode test: classify_injection_warn_mode_above_threshold_returns_suspicious
Two-stage pipeline tests: aligned→Clean, misaligned→Blocked, error→binary fallback
Threshold serde validation tests

Follow-up issues to file

acp.rs inlines wiring instead of shared helpers (DRY gap)
NoInstruction downgrade confidence threshold (IM-1)
SNIPPET_MAX_BYTES doc says "chars" not "bytes" (IM-2)
Probe error/timeout path tests and native.rs causal integration tests
Suspicious verdict not in SecurityEvent enum

…entinel, TurnCausalAnalyzer (#2193, #2208, #2335) Add three complementary layers to Zeph's indirect prompt injection defense: - InjectionEnforcementMode (Warn/Block) for DeBERTa classifier: default Warn mode returns Suspicious instead of Blocked, preventing high FPR from off-the-shelf models from disrupting legitimate tool operations. All fallback paths (regex, error, timeout) respect the enforcement mode via regex_verdict() helper. - CandleThreeClassClassifier: two-stage pipeline that runs binary detection first, then refines positive hits with a three-class model (misaligned-instruction / aligned-instruction / no-instruction). Aligned instructions are downgraded to Clean, substantially reducing FPR. Dynamic id2label from model config.json; load failures allow retry (not permanently cached). - TurnCausalAnalyzer: per-batch LLM probes at tool-return boundaries compute behavioral deviation via normalized Levenshtein + Jaccard. Probe responses bounded by probe_max_chars. Never blocks — emits metric ipi.causal_deviation and SecurityEvent on threshold crossing. Config: [security.causal_ipi] enabled=false, threshold=0.7, provider="fast". Bootstrap wiring in runner.rs, daemon.rs, acp.rs via apply_enforcement_mode(), apply_three_class_classifier(), apply_causal_analyzer(). Pre-existing zeph-sanitizer/guardrail feature forwarding bug fixed as a side effect.

… unused-variable warning in bundle checks

…PI duplication - Populate InitializeResponse.auth_methods with [{type: agent, id: zeph}] using the typed builder; previously returned authMethods: [] which blocked ACP Registry inclusion (#2422) - Serve GET /agent.json with agent identity manifest (id, name, version, description, distribution) for ACP Registry discovery; gated on discovery_enabled (#2422) - Extract apply_three_class_classifier_with_cfg and apply_causal_analyzer_with_cfg helpers in agent_setup.rs; acp.rs now delegates instead of inlining construction eliminating the DRY gap from #2369 (#2370) - discovery.rs already reflects ProtocolVersion::LATEST since PR #2423 (#2412) Closes #2422, closes #2370

…PI duplication (#2431) - Populate InitializeResponse.auth_methods with [{type: agent, id: zeph}] using the typed builder; previously returned authMethods: [] which blocked ACP Registry inclusion (#2422) - Serve GET /agent.json with agent identity manifest (id, name, version, description, distribution) for ACP Registry discovery; gated on discovery_enabled (#2422) - Extract apply_three_class_classifier_with_cfg and apply_causal_analyzer_with_cfg helpers in agent_setup.rs; acp.rs now delegates instead of inlining construction eliminating the DRY gap from #2369 (#2370) - discovery.rs already reflects ProtocolVersion::LATEST since PR #2423 (#2412) Closes #2422, closes #2370

github-actions bot added documentation Improvements or additions to documentation llm zeph-llm crate (Ollama, Claude) rust Rust code changes core zeph-core crate dependencies Dependency updates enhancement New feature or request size/XL Extra large PR (500+ lines) labels Mar 28, 2026

bug-ops mentioned this pull request Mar 28, 2026

fix(security): acp.rs inlines IPI wiring instead of shared helpers (DRY gap from #2369) #2370

Closed

bug-ops force-pushed the deberta-ipi-classifier-fpr branch from a6d9992 to d432809 Compare March 28, 2026 18:48

bug-ops enabled auto-merge (squash) March 28, 2026 18:50

fix(security): gate memory_hint binding on classifiers feature to fix…

2cc9b6b

… unused-variable warning in bundle checks

bug-ops merged commit b27cf4f into main Mar 28, 2026
25 checks passed

bug-ops deleted the deberta-ipi-classifier-fpr branch March 28, 2026 19:13

bug-ops mentioned this pull request Mar 30, 2026

fix(acp): populate authMethods, add /agent.json endpoint, eliminate IPI duplication #2431

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(security): IPI defense — DeBERTa soft-signal, three-class AlignSentinel, TurnCausalAnalyzer#2369

feat(security): IPI defense — DeBERTa soft-signal, three-class AlignSentinel, TurnCausalAnalyzer#2369
bug-ops merged 2 commits intomainfrom
deberta-ipi-classifier-fpr

bug-ops commented Mar 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bug-ops commented Mar 28, 2026

Summary

Test plan

Follow-up issues to file

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant