research: automated adversarial red-teaming with DeepTeam

## Research Finding

**Applicability**: Medium | **Complexity**: Simple

## Problem

Zeph has ContentSanitizer, ExfiltrationGuard, and shell sandbox defenses, but testing is manual (current regression suite REG-006 etc). New attack vectors in skill bodies, MCP tool descriptions, or memory writes may bypass existing defenses undetected between testing cycles.

## Proposed Approach

Run an external black-box test harness against the live agent via ACP HTTP+SSE or daemon A2A endpoint.

**Tool options (evaluate both, pick winner for CI integration):**

### Option A: DeepTeam
Source: https://github.com/confident-ai/deepteam

1. Start `cargo run --features full -- --daemon` (A2A endpoint at `/a2a`)
2. Run `deepteam test --target http://localhost:8080/a2a` with vulnerability classes:
   - Prompt injection (via tool output, skill body, memory recall)
   - Jailbreak via role-play
   - Data exfiltration via markdown images / tool URLs
   - Goal hijacking via adversarial memory saves
3. Score agent responses; file issues for any bypasses found

### Option B: Promptfoo
Source: https://github.com/promptfoo/promptfoo

Works as a black-box tester — 50+ vulnerability types: prompt injection, jailbreaks, tool misuse, authorization bypass. YAML config, CI/CD integration. Can target Zeph's daemon HTTP endpoint (`/a2a`) and ACP HTTP+SSE transport without any Rust SDK.

1. Create Promptfoo test config (YAML) targeting daemon `/a2a` endpoint
2. Define red-team scenarios: prompt injection via tool outputs, tool misuse escalation, sandbox bypass attempts, memory poisoning
3. Evaluate results vs DeepTeam; pick one (or both) for CI integration

## Integration Points

- No code changes required initially — pure external test harness
- `.local/testing/playbooks/redteam.md`: test procedure and vulnerability class selection
- CI integration: optional periodic GitHub Actions job (separate from main CI)
- Issues filed for any bypasses with `security` label

## References

- DeepTeam: https://github.com/confident-ai/deepteam
- Promptfoo: https://github.com/promptfoo/promptfoo, https://www.promptfoo.dev/docs/red-team/agents/
- X-Teaming (coordinated red-teaming): https://arxiv.org/abs/2503.16882
- AgentAssay behavioral fingerprinting: https://arxiv.org/html/2603.02601
- Anthropic Petri framework for autonomous red-teaming
- ATA adversarial harness: #1823

## Internal Meta-Agent Harness (absorbed from #1823)

Build an ATA-style harness on top of `AgentTestHarness` (ARCH-08):
1. **Catalog introspection**: load skill registry + tool definitions to seed scenario generation
2. **Scenario generation**: use summary_model to generate adversarial prompts targeting memory recall, tool edge cases, skill matching, security injection
3. **Adaptive difficulty**: LLM judge scores responses; high-scoring scenarios escalated
4. **Output**: structured test cases in `regressions.md` format

**Source**: ATA (arXiv:2508.17393, August 2025)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

research: automated adversarial red-teaming with DeepTeam #1610

Research Finding

Problem

Proposed Approach

Option A: DeepTeam

Option B: Promptfoo

Integration Points

References

Internal Meta-Agent Harness (absorbed from #1823)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

research: automated adversarial red-teaming with DeepTeam #1610

Description

Research Finding

Problem

Proposed Approach

Option A: DeepTeam

Option B: Promptfoo

Integration Points

References

Internal Meta-Agent Harness (absorbed from #1823)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions