Skip to content

research(testing): ATA-style meta-agent harness for adversarial behavioral test generation #1823

@bug-ops

Description

@bug-ops

Source

Agent-Testing Agent (ATA): Meta-Agent for Adversarial Behavioral Testing
https://arxiv.org/abs/2508.17393 — August 2025

Summary

ATA is a meta-agent that combines static analysis, designer interrogation, and persona-driven adversarial test generation with adaptive difficulty controlled by an LLM-as-judge scoring rubric. It generates behavioral test cases for conversational agents rather than relying on hand-written scenarios.

Applicability to Zeph

HIGH. Zeph's continuous improvement protocol (.claude/rules/continuous-improvement.md) explicitly requires live agent testing but currently relies entirely on manual scenario crafting. The gap between CI unit tests and real behavioral testing is the #1 bottleneck in the CI cycle.

Proposed integration

Build an ATA-style harness on top of AgentTestHarness (already in the codebase from ARCH-08):

  1. Catalog introspection: load Zeph's skill registry + tool definitions to seed scenario generation
  2. Scenario generation: use a separate LLM (e.g., summary_model) to generate adversarial prompts targeting:
    • Memory recall boundary conditions (just-expired memories, conflicting facts)
    • Tool invocation edge cases (large output → overflow, permission denial, tool chaining)
    • Skill matching precision (ambiguous queries that should/shouldn't match)
    • Security injection attempts (prompt injection in tool results, web scrape content)
  3. Adaptive difficulty: an LLM judge scores agent responses; scenarios that score high are escalated with harder variants
  4. Output: structured test cases in regressions.md format with expected behavior labels

Location

  • New binary or subcommand: zeph test-gen (or --test-gen)
  • Stores generated scenarios in .local/testing/playbooks/generated/
  • Integrates with AgentTestHarness for execution and response capture

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    P4Long-term / exploratoryresearchResearch-driven improvement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions