Evaluation framework for goal-seeking AI agents. Tests memory recall, tool use, planning, and reasoning across progressive difficulty levels (L1-L12).
- Long-horizon memory stress tests: Generates 1000+ turn dialogues with embedded facts, then quizzes the agent on details from various points in the conversation
- Hybrid grading: Deterministic (rubric keywords) + LLM (semantic judgment) with multi-vote stability
- Progressive difficulty levels: L1 (simple recall) through L12 (far transfer reasoning)
- Agent-agnostic: Works with any agent through the
AgentAdapterinterface - Self-improvement loop: Automated EVAL -> ANALYZE -> PROPOSE -> CHALLENGE -> VOTE -> APPLY -> RE-EVAL cycle
- Multi-seed holdout: Run across multiple random seeds to measure inter-seed variance
# Basic installation (datasets, reports, HTTP/subprocess adapters; no LLM grading)
pip install amplihack-agent-eval
# With Anthropic grading support
pip install amplihack-agent-eval[anthropic]
# With Azure/Event Hubs distributed-hive adapter support
pip install amplihack-agent-eval[azure]
# Development
pip install amplihack-agent-eval[dev]
# Everything
pip install amplihack-agent-eval[all,dev]learning-agent, continuous, and python -m amplihack_eval.azure.eval_distributed all import the sibling amplihack package. This repo does not declare that dependency directly because the main repo already depends on amplihack-agent-eval. Install a sibling checkout of amplihack when you need those surfaces.
from amplihack_eval import AgentAdapter, AgentResponse
class MyMemoryAgent(AgentAdapter):
def __init__(self):
self.memory = []
def learn(self, content: str) -> None:
self.memory.append(content)
def answer(self, question: str) -> AgentResponse:
# Your agent's retrieval + reasoning logic here
relevant = [m for m in self.memory if any(w in m.lower() for w in question.lower().split())]
return AgentResponse(answer=" ".join(relevant[:3]) if relevant else "I don't know")
def reset(self) -> None:
self.memory.clear()
def close(self) -> None:
passfrom amplihack_eval import EvalRunner
agent = MyMemoryAgent()
runner = EvalRunner(num_turns=100, num_questions=20, grader_votes=3)
report = runner.run(agent)
print(f"Overall score: {report.overall_score:.2%}")
for cb in report.category_breakdown:
print(f" {cb.category}: {cb.avg_score:.2%}")# Run eval against an HTTP agent
amplihack-eval run --turns 100 --questions 20 --adapter http --agent-url http://localhost:8000
# Run eval with amplihack's LearningAgent (requires sibling amplihack install)
amplihack-eval run --turns 100 --questions 20 --adapter learning-agent
# Multi-seed comparison
amplihack-eval compare --seeds 42,123,456,789 --turns 100
# Self-improvement loop
amplihack-eval self-improve --iterations 5 --turns 100The AgentAdapter is the core abstraction. Implement these four methods to make any agent evaluable:
| Method | Purpose |
|---|---|
learn(content: str) |
Feed content to the agent for learning/memorization |
answer(question: str) -> AgentResponse |
Ask the agent a question |
reset() |
Reset agent state between eval runs |
close() |
Clean up resources |
Optional properties:
capabilities -> set[str]: Declare what the agent can do (default:{"memory"})name -> str: Human-readable name (default: class name)
| Adapter | Use case |
|---|---|
HttpAdapter |
Any agent with REST API (POST /learn, POST /answer, POST /reset) |
SubprocessAdapter |
Any agent invokable via CLI subprocess |
LearningAgentAdapter |
amplihack's LearningAgent (requires amplihack package) |
DistributedHiveAdapter |
Azure/Event Hubs distributed hive fleet (amplihack-agent-eval[azure]) |
See docs/adapters.md for the complete adapter writing guide.
| Level | Name | Tests |
|---|---|---|
| L1 | Single Source Direct Recall | Basic fact retrieval from a single source |
| L2 | Multi-Source Synthesis | Combining information across multiple sources |
| L3 | Temporal Reasoning | Understanding changes over time, computing differences |
| L4 | Procedural Learning | Learning and applying step-by-step procedures |
| L5 | Contradiction Handling | Detecting and reasoning about conflicting information |
| L6 | Incremental Learning | Updating knowledge when new information arrives |
| Level | Name | Tests |
|---|---|---|
| L7 | Teaching Session | Agent learns, then teaches; graded on teaching accuracy |
| Level | Name | Tests |
|---|---|---|
| L8 | Confidence Calibration | Knowing what you know vs. don't know |
| L9 | Causal Reasoning | Identifying causal chains and root causes |
| L10 | Counterfactual Reasoning | "What if X didn't happen?" reasoning |
| Level | Name | Tests |
|---|---|---|
| L11 | Novel Skill Acquisition | Learning genuinely new skills from documentation |
| L12 | Far Transfer | Applying learned reasoning patterns to new domains |
See docs/levels.md for detailed descriptions of each level.
Main evaluation runner for long-horizon memory stress tests.
runner = EvalRunner(
num_turns=100, # Number of dialogue turns to generate
num_questions=20, # Number of questions to ask
grader_votes=3, # Multi-vote grading (take median)
seed=42, # Random seed for reproducibility
)
report = runner.run(agent) # Returns EvalReportAggregate evaluation results.
report.overall_score # float: 0.0 to 1.0
report.results # list[EvalResult]: per-question results
report.category_breakdown # list[CategoryBreakdown]: per-category averages
report.metadata # dict: run configurationPer-question evaluation result.
result.question_id # str: unique question identifier
result.question_text # str: the question asked
result.expected_answer # str: ground truth answer
result.actual_answer # str: agent's answer
result.overall_score # float: 0.0 to 1.0
result.dimensions # list[DimensionScore]: per-dimension scores
result.category # str: question categoryResult from the hybrid grader.
grade = grade_answer(question, expected, actual, votes=3)
grade.score # float: 0.0 to 1.0
grade.reasoning # str: explanation of the grade
grade.vote_scores # list[float] | None: individual vote scoresfrom amplihack_eval.data import generate_dialogue, generate_questions
# Generate a reproducible dialogue
ground_truth = generate_dialogue(num_turns=100, seed=42)
# Generate questions from the dialogue
questions = generate_questions(ground_truth, num_questions=20)from amplihack_eval.self_improve.runner import SelfImproveConfig, SelfImproveRunner
config = SelfImproveConfig(max_iterations=5, num_turns=100)
runner = SelfImproveRunner(config)
result = runner.run(agent_factory=lambda: MyAgent())See docs/self-improvement.md for the complete self-improvement guide.
from amplihack_eval import AgentAdapter, AgentResponse, ToolCall
class MyCustomAgent(AgentAdapter):
"""Adapter for my custom agent."""
def __init__(self, config):
self.config = config
self.client = MyAgentClient(config)
def learn(self, content: str) -> None:
self.client.ingest(content)
def answer(self, question: str) -> AgentResponse:
result = self.client.query(question)
return AgentResponse(
answer=result.text,
tool_calls=[
ToolCall(
tool_name=tc.name,
arguments=tc.args,
result=tc.output,
)
for tc in result.tool_calls
],
reasoning_trace=result.chain_of_thought,
confidence=result.confidence,
)
def reset(self) -> None:
self.client.clear_memory()
def close(self) -> None:
self.client.shutdown()
@property
def capabilities(self) -> set[str]:
return {"memory", "tool_use", "planning"}
@property
def name(self) -> str:
return f"MyAgent(v{self.config.version})"| Variable | Purpose | Default |
|---|---|---|
ANTHROPIC_API_KEY |
Required for LLM grading | - |
GRADER_MODEL |
Model for grading | claude-sonnet-4-5-20250929 |
EVAL_MODEL |
Model for LearningAgent adapter | claude-sonnet-4-5-20250929 |
src/amplihack_eval/
__init__.py # Public API exports
cli.py # CLI entry point (amplihack-eval)
py.typed # PEP 561 type checking marker
adapters/
base.py # AgentAdapter interface + ToolCall + AgentResponse
http_adapter.py # HTTP REST adapter
subprocess_adapter.py # CLI subprocess adapter
learning_agent.py # amplihack LearningAgent adapter
core/
runner.py # EvalRunner (long-horizon memory eval)
grader.py # Hybrid deterministic + LLM grading
multi_seed.py # Multi-seed holdout evaluation
data/
long_horizon.py # 5000-turn dialogue generator
progressive_levels.py # L1-L12 level definitions
self_improve/
runner.py # Self-improvement loop orchestrator
patch_proposer.py # LLM-powered patch generation
reviewer_voting.py # 3-reviewer A/B voting
levels/ # Convenience re-export of level definitions
multi_agent_eval/ # Multi-agent scenarios (future)
docs/
index.md # Documentation landing page (GitHub Pages)
architecture.md # Package architecture overview
adapters.md # How to write custom AgentAdapters
levels.md # Complete guide to all eval levels
self-improvement.md # How the self-improvement loop works
multi-agent-eval.md # Multi-agent eval architecture
tests/
test_adapters.py # Adapter interface tests
test_data_generation.py # Data generator tests
# Clone the repository
git clone https://github.com/rysweet/amplihack-agent-eval.git
cd amplihack-agent-eval
# Install in development mode
pip install -e ".[dev]"
# Install pre-commit hooks
pre-commit install
# Run tests
pytest tests/ -q
# Run linting
ruff check src/ tests/
ruff format --check src/ tests/- All code must pass
ruff checkandruff formatchecks - New features require tests in
tests/ - Follow existing code patterns (dataclasses, type hints, docstrings)
- The
AgentAdapterinterface is the public contract -- changes require careful consideration - Use
from __future__ import annotationsin all modules (Python 3.10 compatibility)
- Create a feature branch from
main - Make your changes with tests
- Ensure CI passes (lint + format + tests across Python 3.10-3.12)
- Open a PR with a clear description of changes
MIT