Hawk is an infrastructure system for running Inspect AI evaluations and Scout scans in Kubernetes. It provides:
- A
hawkCLI tool for submitting evaluation and scan configurations - A FastAPI server that orchestrates Kubernetes jobs using Helm
- Multiple Lambda functions for log processing, access control, and sample editing
- A PostgreSQL data warehouse for evaluation results
Before using Hawk, ensure you have:
- Python 3.11 or later
- uv for dependency management
- Access to a Hawk deployment - You'll need:
- Hawk API server URL
- Authentication credentials (OAuth2)
- For deployment: Kubernetes cluster, AWS account, Terraform 1.10+
Install the Hawk CLI:
uv pip install "hawk[cli] @ git+https://github.com/METR/inspect-action"Or install from source:
git clone https://github.com/METR/inspect-action.git
cd inspect-action
uv pip install -e .[cli]First, log in to your Hawk server:
hawk loginThis will open a browser for OAuth2 authentication.
Create a simple eval config file or use an example:
hawk eval-set examples/simple.eval-set.yamlOpen the evaluation in your browser:
hawk webOr view logs and results in the configured log viewer.
Set these before using the Hawk CLI:
| Variable | Required | Description | Example |
|---|---|---|---|
HAWK_API_URL |
Yes | URL of your Hawk API server | https://hawk.example.com |
INSPECT_LOG_ROOT_DIR |
Yes | S3 bucket for eval logs | s3://my-bucket/evals |
LOG_VIEWER_BASE_URL |
No | URL for web log viewer | https://viewer.example.com |
You can set these in a .env file in your project directory or export them in your shell:
export HAWK_API_URL=https://hawk.example.com
export INSPECT_LOG_ROOT_DIR=s3://my-bucket/evalsFor API server and CLI authentication:
INSPECT_ACTION_API_MODEL_ACCESS_TOKEN_AUDIENCEINSPECT_ACTION_API_MODEL_ACCESS_TOKEN_ISSUERINSPECT_ACTION_API_MODEL_ACCESS_TOKEN_JWKS_PATH
For log viewer authentication (can be different):
VITE_API_BASE_URL- Should match HAWK_API_URLVITE_OIDC_ISSUERVITE_OIDC_CLIENT_IDVITE_OIDC_TOKEN_PATH
hawk eval-set examples/simple.eval-set.yamlThe eval set config file is a YAML file that defines a grid of tasks, solvers/agents, and models to evaluate.
See examples/simple.eval-set.yaml for a minimal working example.
tasks:
- package: git+https://github.com/UKGovernmentBEIS/inspect_evals
name: inspect_evals
items:
- name: mbpp
sample_ids: [1, 2, 3] # Optional: test specific samples
models:
- package: openai
name: openai
items:
- name: gpt-4o-miniAgents/Solvers (agents is the newer name for solvers):
agents:
- package: git+https://github.com/METR/inspect-agents
name: metr_agents
items:
- name: react
args:
max_attempts: 3Runner Configuration:
runner:
secrets:
- name: DATASET_ACCESS_KEY
description: API key for dataset access
environment:
FOO_BAR: custom_value
packages:
- git+https://github.com/some-package # Additional packages to installInspect AI Parameters (passed to inspect.eval_set):
eval_set_id: Custom ID (generated if not specified)limit: Maximum samples to evaluatetime_limit: Per-sample time limit in secondsmessage_limit: Maximum messages per sampleepochs: Number of evaluation epochsmetadata: Custom metadata dictionarytags: List of tags for organization
For the complete schema, see hawk/core/types/evals.py or the Inspect AI documentation.
Use --secret or --secrets-file to pass environment variables to your evaluation:
# Single variable
hawk eval-set config.yaml --secret MY_API_KEY
# From file
hawk eval-set config.yaml --secrets-file .env
# Multiple files
hawk eval-set config.yaml --secrets-file .env --secrets-file .secrets.localSecrets file format:
# .secrets
DATASET_API_KEY=your_key_here
CUSTOM_MODEL_KEY=another_keyAPI Keys: By default, Hawk uses a managed LLM proxy for OpenAI, Anthropic, and Google Vertex models. For other providers, pass API keys as secrets. You can override the proxy by setting OPENAI_API_KEY, ANTHROPIC_API_KEY, or VERTEX_API_KEY as secrets.
Required Secrets: Declare required secrets in your config using runner.secrets to prevent jobs from starting with missing credentials.
hawk scan examples/simple.scan.yamlLike the eval set config file, the SCAN_CONFIG_FILE is a YAML file that defines a scan run.
name: my-scan # An optional pretty name for the scan run
scanners:
- package: git+https://github.com/METR/inspect-agents
name: metr_agents
items:
- name: reward_hacking_scanner
models:
- package: openai
name: openai
items:
- name: gpt-5
packages:
# Any other packages to install in the venv where the job will run
- git+https://github.com/DanielPolatajko/inspect_wandb[weave]
transcripts:
sources:
- eval_set_id: inspect-eval-set-s6m74hwcd7jag1gl
filter:
where:
- eval_status: success
limit: 10
shuffle: true
metadata: dict[str, Any] | null
tags: list[str] | nullYou can specify scanners[].items[].key to assign unique keys to different instances of the same scanner, e.g. to run it with different arguments.
Scans analyze transcripts (execution logs) from previous evaluations. Use filters to select specific samples.
Common filter examples:
transcripts:
sources:
- eval_set_id: inspect-eval-set-abc123
filter:
where:
- eval_status: success # Only successful runs
- score: {gt: 0.5} # Score above 0.5
- model: {like: "gpt-4%"} # GPT-4 models only
- metadata.agent.name: react # Nested metadata access
limit: 100
shuffle: trueAvailable filter operators:
- Equality:
field: valueorfield: [val1, val2](IN list) - Comparison:
{gt: 0.5},{ge: 0.5},{lt: 1.0},{le: 1.0},{between: [0.5, 1.0]} - Pattern matching:
{like: "pattern"},{ilike: "PATTERN"}(case-insensitive) - Logical:
{not: condition},{or: [cond1, cond2]} - Null checks:
field: null
Per-scanner filters: Use scanners[].items[].filter to override the global filter for specific scanners.
For the complete filter syntax, see hawk/core/types/scans.py.
This repository provides a Terraform module for deploying Hawk to AWS. The infrastructure includes:
- ECS Fargate for the FastAPI server
- EKS for running evaluation jobs
- Aurora PostgreSQL for the data warehouse
- Lambda functions for log processing and access control
- S3 for log storage
To deploy Hawk, reference the terraform/ directory as a module in your infrastructure Terraform project and deploy through your infrastructure pipeline (e.g., Spacelift).
See CONTRIBUTING.md for instructions on updating Inspect AI/Scout versions and running smoke tests.
hawk login # Log in via OAuth2 Device Authorization flow
hawk auth access-token # Print valid access token to stdout
hawk auth refresh-token # Print current refresh tokenhawk eval-set CONFIG.yaml [OPTIONS]Run an Inspect eval set remotely. The config file contains a grid of tasks, solvers, and models.
Options:
| Option | Description |
|---|---|
--image-tag TEXT |
Specify runner image tag |
--secrets-file FILE |
Load environment variables from secrets file (can be repeated) |
--secret TEXT |
Pass environment variable as secret (can be repeated) |
--skip-confirm |
Skip confirmation prompt for unknown config warnings |
--log-dir-allow-dirty |
Allow unrelated eval logs in log directory |
Example:
hawk eval-set examples/simple.eval-set.yaml --secret OPENAI_API_KEYhawk scan CONFIG.yaml [OPTIONS]Run a Scout scan remotely. The config file contains a matrix of scanners and models.
Options:
| Option | Description |
|---|---|
--image-tag TEXT |
Specify runner image tag |
--secrets-file FILE |
Load environment variables from secrets file (can be repeated) |
--secret TEXT |
Pass environment variable as secret (can be repeated) |
--skip-confirm |
Skip confirmation prompt for unknown config warnings |
Example:
hawk scan examples/simple.scan.yamlhawk delete [EVAL_SET_ID] # Delete eval set and clean up resources (not logs)
hawk web [EVAL_SET_ID] # Open eval set in web browser
hawk view-sample SAMPLE_UUID # Open specific sample in web browserIf EVAL_SET_ID is not provided, uses the last eval set ID from the current session.
hawk edit-samples EDITS_FILESubmit sample edits to the Hawk API. Accepts JSON or JSONL files.
JSON format:
[
{"sample_uuid": "...", "details": {"type": "score_edit", ...}},
{"sample_uuid": "...", "details": {"type": "invalidate_sample", ...}}
]JSONL format:
{"sample_uuid": "...", "details": {"type": "score_edit", ...}}
{"sample_uuid": "...", "details": {"type": "invalidate_sample", ...}}