-
Notifications
You must be signed in to change notification settings - Fork 2
research: automated adversarial red-teaming with DeepTeam #1610
Description
Research Finding
Applicability: Medium | Complexity: Simple
Problem
Zeph has ContentSanitizer, ExfiltrationGuard, and shell sandbox defenses, but testing is manual (current regression suite REG-006 etc). New attack vectors in skill bodies, MCP tool descriptions, or memory writes may bypass existing defenses undetected between testing cycles.
Proposed Approach
Run an external black-box test harness against the live agent via ACP HTTP+SSE or daemon A2A endpoint.
Tool options (evaluate both, pick winner for CI integration):
Option A: DeepTeam
- Start
cargo run --features full -- --daemon(A2A endpoint at/a2a) - Run
deepteam test --target http://localhost:8080/a2awith vulnerability classes:- Prompt injection (via tool output, skill body, memory recall)
- Jailbreak via role-play
- Data exfiltration via markdown images / tool URLs
- Goal hijacking via adversarial memory saves
- Score agent responses; file issues for any bypasses found
Option B: Promptfoo
Works as a black-box tester — 50+ vulnerability types: prompt injection, jailbreaks, tool misuse, authorization bypass. YAML config, CI/CD integration. Can target Zeph's daemon HTTP endpoint (/a2a) and ACP HTTP+SSE transport without any Rust SDK.
- Create Promptfoo test config (YAML) targeting daemon
/a2aendpoint - Define red-team scenarios: prompt injection via tool outputs, tool misuse escalation, sandbox bypass attempts, memory poisoning
- Evaluate results vs DeepTeam; pick one (or both) for CI integration
Integration Points
- No code changes required initially — pure external test harness
.local/testing/playbooks/redteam.md: test procedure and vulnerability class selection- CI integration: optional periodic GitHub Actions job (separate from main CI)
- Issues filed for any bypasses with
securitylabel
References
- DeepTeam: https://github.com/confident-ai/deepteam
- Promptfoo: https://github.com/promptfoo/promptfoo, https://www.promptfoo.dev/docs/red-team/agents/
- X-Teaming (coordinated red-teaming): https://arxiv.org/abs/2503.16882
- AgentAssay behavioral fingerprinting: https://arxiv.org/html/2603.02601
- Anthropic Petri framework for autonomous red-teaming
- ATA adversarial harness: research(testing): ATA-style meta-agent harness for adversarial behavioral test generation #1823
Internal Meta-Agent Harness (absorbed from #1823)
Build an ATA-style harness on top of AgentTestHarness (ARCH-08):
- Catalog introspection: load skill registry + tool definitions to seed scenario generation
- Scenario generation: use summary_model to generate adversarial prompts targeting memory recall, tool edge cases, skill matching, security injection
- Adaptive difficulty: LLM judge scores responses; high-scoring scenarios escalated
- Output: structured test cases in
regressions.mdformat
Source: ATA (arXiv:2508.17393, August 2025)