Skip to content

research: automated adversarial red-teaming with DeepTeam #1610

@bug-ops

Description

@bug-ops

Research Finding

Applicability: Medium | Complexity: Simple

Problem

Zeph has ContentSanitizer, ExfiltrationGuard, and shell sandbox defenses, but testing is manual (current regression suite REG-006 etc). New attack vectors in skill bodies, MCP tool descriptions, or memory writes may bypass existing defenses undetected between testing cycles.

Proposed Approach

Run an external black-box test harness against the live agent via ACP HTTP+SSE or daemon A2A endpoint.

Tool options (evaluate both, pick winner for CI integration):

Option A: DeepTeam

Source: https://github.com/confident-ai/deepteam

  1. Start cargo run --features full -- --daemon (A2A endpoint at /a2a)
  2. Run deepteam test --target http://localhost:8080/a2a with vulnerability classes:
    • Prompt injection (via tool output, skill body, memory recall)
    • Jailbreak via role-play
    • Data exfiltration via markdown images / tool URLs
    • Goal hijacking via adversarial memory saves
  3. Score agent responses; file issues for any bypasses found

Option B: Promptfoo

Source: https://github.com/promptfoo/promptfoo

Works as a black-box tester — 50+ vulnerability types: prompt injection, jailbreaks, tool misuse, authorization bypass. YAML config, CI/CD integration. Can target Zeph's daemon HTTP endpoint (/a2a) and ACP HTTP+SSE transport without any Rust SDK.

  1. Create Promptfoo test config (YAML) targeting daemon /a2a endpoint
  2. Define red-team scenarios: prompt injection via tool outputs, tool misuse escalation, sandbox bypass attempts, memory poisoning
  3. Evaluate results vs DeepTeam; pick one (or both) for CI integration

Integration Points

  • No code changes required initially — pure external test harness
  • .local/testing/playbooks/redteam.md: test procedure and vulnerability class selection
  • CI integration: optional periodic GitHub Actions job (separate from main CI)
  • Issues filed for any bypasses with security label

References

Internal Meta-Agent Harness (absorbed from #1823)

Build an ATA-style harness on top of AgentTestHarness (ARCH-08):

  1. Catalog introspection: load skill registry + tool definitions to seed scenario generation
  2. Scenario generation: use summary_model to generate adversarial prompts targeting memory recall, tool edge cases, skill matching, security injection
  3. Adaptive difficulty: LLM judge scores responses; high-scoring scenarios escalated
  4. Output: structured test cases in regressions.md format

Source: ATA (arXiv:2508.17393, August 2025)

Metadata

Metadata

Assignees

No one assigned

    Labels

    P4Long-term / exploratorybacklogDeferred work, no active sprintresearchResearch-driven improvementsecuritySecurity-related issue

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions