Hawk - Inspect AI Infrastructure

Hawk is an infrastructure system for running Inspect AI evaluations and Scout scans in Kubernetes. It provides:

A hawk CLI tool for submitting evaluation and scan configurations
A FastAPI server that orchestrates Kubernetes jobs using Helm
Multiple Lambda functions for log processing, access control, and sample editing
A PostgreSQL data warehouse for evaluation results

Prerequisites

Before using Hawk, ensure you have:

Python 3.11 or later
uv for dependency management
Access to a Hawk deployment - You'll need:
- Hawk API server URL
- Authentication credentials (OAuth2)
For deployment: Kubernetes cluster, AWS account, Terraform 1.10+

Installation

Install the Hawk CLI:

uv pip install "hawk[cli] @ git+https://github.com/METR/inspect-action"

Or install from source:

git clone https://github.com/METR/inspect-action.git
cd inspect-action
uv pip install -e .[cli]

Quick Start

1. Authenticate

First, log in to your Hawk server:

hawk login

This will open a browser for OAuth2 authentication.

2. Run Your First Evaluation

Create a simple eval config file or use an example:

hawk eval-set examples/simple.eval-set.yaml

3. View Results

Open the evaluation in your browser:

hawk web

Or view logs and results in the configured log viewer.

Configuration

Required Environment Variables

Set these before using the Hawk CLI:

Variable	Required	Description	Example
`HAWK_API_URL`	Yes	URL of your Hawk API server	`https://hawk.example.com`
`INSPECT_LOG_ROOT_DIR`	Yes	S3 bucket for eval logs	`s3://my-bucket/evals`
`LOG_VIEWER_BASE_URL`	No	URL for web log viewer	`https://viewer.example.com`

You can set these in a .env file in your project directory or export them in your shell:

export HAWK_API_URL=https://hawk.example.com
export INSPECT_LOG_ROOT_DIR=s3://my-bucket/evals

Authentication Variables

For API server and CLI authentication:

INSPECT_ACTION_API_MODEL_ACCESS_TOKEN_AUDIENCE
INSPECT_ACTION_API_MODEL_ACCESS_TOKEN_ISSUER
INSPECT_ACTION_API_MODEL_ACCESS_TOKEN_JWKS_PATH

For log viewer authentication (can be different):

VITE_API_BASE_URL - Should match HAWK_API_URL
VITE_OIDC_ISSUER
VITE_OIDC_CLIENT_ID
VITE_OIDC_TOKEN_PATH

Running Eval Sets

hawk eval-set examples/simple.eval-set.yaml

The Eval Set Config File

The eval set config file is a YAML file that defines a grid of tasks, solvers/agents, and models to evaluate.

See examples/simple.eval-set.yaml for a minimal working example.

Required Fields

tasks:
  - package: git+https://github.com/UKGovernmentBEIS/inspect_evals
    name: inspect_evals
    items:
      - name: mbpp
        sample_ids: [1, 2, 3]  # Optional: test specific samples

models:
  - package: openai
    name: openai
    items:
      - name: gpt-4o-mini

Optional Fields

Agents/Solvers (agents is the newer name for solvers):

agents:
  - package: git+https://github.com/METR/inspect-agents
    name: metr_agents
    items:
      - name: react
        args:
          max_attempts: 3

Runner Configuration:

runner:
  secrets:
    - name: DATASET_ACCESS_KEY
      description: API key for dataset access
  environment:
    FOO_BAR: custom_value

packages:
  - git+https://github.com/some-package  # Additional packages to install

Inspect AI Parameters (passed to inspect.eval_set):

eval_set_id: Custom ID (generated if not specified)
limit: Maximum samples to evaluate
time_limit: Per-sample time limit in seconds
message_limit: Maximum messages per sample
epochs: Number of evaluation epochs
metadata: Custom metadata dictionary
tags: List of tags for organization

For the complete schema, see hawk/core/types/evals.py or the Inspect AI documentation.

Passing Environment Variables and Secrets

Use --secret or --secrets-file to pass environment variables to your evaluation:

# Single variable
hawk eval-set config.yaml --secret MY_API_KEY

# From file
hawk eval-set config.yaml --secrets-file .env

# Multiple files
hawk eval-set config.yaml --secrets-file .env --secrets-file .secrets.local

Secrets file format:

# .secrets
DATASET_API_KEY=your_key_here
CUSTOM_MODEL_KEY=another_key

API Keys: By default, Hawk uses a managed LLM proxy for OpenAI, Anthropic, and Google Vertex models. For other providers, pass API keys as secrets. You can override the proxy by setting OPENAI_API_KEY, ANTHROPIC_API_KEY, or VERTEX_API_KEY as secrets.

Required Secrets: Declare required secrets in your config using runner.secrets to prevent jobs from starting with missing credentials.

Running Scans

hawk scan examples/simple.scan.yaml

The Scan Config File

Like the eval set config file, the SCAN_CONFIG_FILE is a YAML file that defines a scan run.

name: my-scan # An optional pretty name for the scan run

scanners:
  - package: git+https://github.com/METR/inspect-agents
    name: metr_agents
    items:
      - name: reward_hacking_scanner

models:
  - package: openai
    name: openai
    items:
      - name: gpt-5

packages:
  # Any other packages to install in the venv where the job will run
  - git+https://github.com/DanielPolatajko/inspect_wandb[weave]

transcripts:
  sources:
    - eval_set_id: inspect-eval-set-s6m74hwcd7jag1gl
  filter:
    where:
      - eval_status: success
    limit: 10
    shuffle: true

metadata: dict[str, Any] | null
tags: list[str] | null

You can specify scanners[].items[].key to assign unique keys to different instances of the same scanner, e.g. to run it with different arguments.

Transcript Filtering

Scans analyze transcripts (execution logs) from previous evaluations. Use filters to select specific samples.

Common filter examples:

transcripts:
  sources:
    - eval_set_id: inspect-eval-set-abc123
  filter:
    where:
      - eval_status: success           # Only successful runs
      - score: {gt: 0.5}               # Score above 0.5
      - model: {like: "gpt-4%"}        # GPT-4 models only
      - metadata.agent.name: react     # Nested metadata access
    limit: 100
    shuffle: true

Available filter operators:

Equality: field: value or field: [val1, val2] (IN list)
Comparison: {gt: 0.5}, {ge: 0.5}, {lt: 1.0}, {le: 1.0}, {between: [0.5, 1.0]}
Pattern matching: {like: "pattern"}, {ilike: "PATTERN"} (case-insensitive)
Logical: {not: condition}, {or: [cond1, cond2]}
Null checks: field: null

Per-scanner filters: Use scanners[].items[].filter to override the global filter for specific scanners.

For the complete filter syntax, see hawk/core/types/scans.py.

Deployment

This repository provides a Terraform module for deploying Hawk to AWS. The infrastructure includes:

ECS Fargate for the FastAPI server
EKS for running evaluation jobs
Aurora PostgreSQL for the data warehouse
Lambda functions for log processing and access control
S3 for log storage

To deploy Hawk, reference the terraform/ directory as a module in your infrastructure Terraform project and deploy through your infrastructure pipeline (e.g., Spacelift).

See CONTRIBUTING.md for instructions on updating Inspect AI/Scout versions and running smoke tests.

CLI Reference

Authentication

hawk login                    # Log in via OAuth2 Device Authorization flow
hawk auth access-token        # Print valid access token to stdout
hawk auth refresh-token       # Print current refresh token

Running Eval Sets

hawk eval-set CONFIG.yaml [OPTIONS]

Run an Inspect eval set remotely. The config file contains a grid of tasks, solvers, and models.

Options:

Option	Description
`--image-tag TEXT`	Specify runner image tag
`--secrets-file FILE`	Load environment variables from secrets file (can be repeated)
`--secret TEXT`	Pass environment variable as secret (can be repeated)
`--skip-confirm`	Skip confirmation prompt for unknown config warnings
`--log-dir-allow-dirty`	Allow unrelated eval logs in log directory

Example:

hawk eval-set examples/simple.eval-set.yaml --secret OPENAI_API_KEY

Running Scout Scans

hawk scan CONFIG.yaml [OPTIONS]

Run a Scout scan remotely. The config file contains a matrix of scanners and models.

Options:

Option	Description
`--image-tag TEXT`	Specify runner image tag
`--secrets-file FILE`	Load environment variables from secrets file (can be repeated)
`--secret TEXT`	Pass environment variable as secret (can be repeated)
`--skip-confirm`	Skip confirmation prompt for unknown config warnings

Example:

hawk scan examples/simple.scan.yaml

Resource Management

hawk delete [EVAL_SET_ID]     # Delete eval set and clean up resources (not logs)
hawk web [EVAL_SET_ID]        # Open eval set in web browser
hawk view-sample SAMPLE_UUID  # Open specific sample in web browser

If EVAL_SET_ID is not provided, uses the last eval set ID from the current session.

Sample Editing

hawk edit-samples EDITS_FILE

Submit sample edits to the Hawk API. Accepts JSON or JSONL files.

JSON format:

[
  {"sample_uuid": "...", "details": {"type": "score_edit", ...}},
  {"sample_uuid": "...", "details": {"type": "invalidate_sample", ...}}
]

JSONL format:

{"sample_uuid": "...", "details": {"type": "score_edit", ...}}
{"sample_uuid": "...", "details": {"type": "invalidate_sample", ...}}

Name		Name	Last commit message	Last commit date
Latest commit History 527 Commits
.claude/skills/db-migrations		.claude/skills/db-migrations
.cursor/rules		.cursor/rules
.devcontainer		.devcontainer
.github		.github
.mentat		.mentat
examples		examples
hawk		hawk
scripts		scripts
terraform		terraform
tests		tests
www		www
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.env.local		.env.local
.env.staging		.env.staging
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
ARCHITECTURE.md		ARCHITECTURE.md
CLAUDE.md		CLAUDE.md
CODEOWNERS		CODEOWNERS
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.debug.yaml		docker-compose.debug.yaml
docker-compose.yaml		docker-compose.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Hawk - Inspect AI Infrastructure

Prerequisites

Installation

Quick Start

1. Authenticate

2. Run Your First Evaluation

3. View Results

Configuration

Required Environment Variables

Authentication Variables

Running Eval Sets

The Eval Set Config File

Required Fields

Optional Fields

Passing Environment Variables and Secrets

Running Scans

The Scan Config File

Transcript Filtering

Deployment

CLI Reference

Authentication

Running Eval Sets

Running Scout Scans

Resource Management

Sample Editing

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 15

Uh oh!

Languages

License

METR/inspect-action

Folders and files

Latest commit

History

Repository files navigation

Hawk - Inspect AI Infrastructure

Prerequisites

Installation

Quick Start

1. Authenticate

2. Run Your First Evaluation

3. View Results

Configuration

Required Environment Variables

Authentication Variables

Running Eval Sets

The Eval Set Config File

Required Fields

Optional Fields

Passing Environment Variables and Secrets

Running Scans

The Scan Config File

Transcript Filtering

Deployment

CLI Reference

Authentication

Running Eval Sets

Running Scout Scans

Resource Management

Sample Editing

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 15

Uh oh!

Languages

Packages