Skip to content

METR/inspect-action

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hawk - Inspect AI Infrastructure

Hawk is an infrastructure system for running Inspect AI evaluations and Scout scans in Kubernetes. It provides:

  • A hawk CLI tool for submitting evaluation and scan configurations
  • A FastAPI server that orchestrates Kubernetes jobs using Helm
  • Multiple Lambda functions for log processing, access control, and sample editing
  • A PostgreSQL data warehouse for evaluation results

Prerequisites

Before using Hawk, ensure you have:

  • Python 3.11 or later
  • uv for dependency management
  • Access to a Hawk deployment - You'll need:
    • Hawk API server URL
    • Authentication credentials (OAuth2)
  • For deployment: Kubernetes cluster, AWS account, Terraform 1.10+

Installation

Install the Hawk CLI:

uv pip install "hawk[cli] @ git+https://github.com/METR/inspect-action"

Or install from source:

git clone https://github.com/METR/inspect-action.git
cd inspect-action
uv pip install -e .[cli]

Quick Start

1. Authenticate

First, log in to your Hawk server:

hawk login

This will open a browser for OAuth2 authentication.

2. Run Your First Evaluation

Create a simple eval config file or use an example:

hawk eval-set examples/simple.eval-set.yaml

3. View Results

Open the evaluation in your browser:

hawk web

Or view logs and results in the configured log viewer.

Configuration

Required Environment Variables

Set these before using the Hawk CLI:

Variable Required Description Example
HAWK_API_URL Yes URL of your Hawk API server https://hawk.example.com
INSPECT_LOG_ROOT_DIR Yes S3 bucket for eval logs s3://my-bucket/evals
LOG_VIEWER_BASE_URL No URL for web log viewer https://viewer.example.com

You can set these in a .env file in your project directory or export them in your shell:

export HAWK_API_URL=https://hawk.example.com
export INSPECT_LOG_ROOT_DIR=s3://my-bucket/evals

Authentication Variables

For API server and CLI authentication:

  • INSPECT_ACTION_API_MODEL_ACCESS_TOKEN_AUDIENCE
  • INSPECT_ACTION_API_MODEL_ACCESS_TOKEN_ISSUER
  • INSPECT_ACTION_API_MODEL_ACCESS_TOKEN_JWKS_PATH

For log viewer authentication (can be different):

  • VITE_API_BASE_URL - Should match HAWK_API_URL
  • VITE_OIDC_ISSUER
  • VITE_OIDC_CLIENT_ID
  • VITE_OIDC_TOKEN_PATH

Running Eval Sets

hawk eval-set examples/simple.eval-set.yaml

The Eval Set Config File

The eval set config file is a YAML file that defines a grid of tasks, solvers/agents, and models to evaluate.

See examples/simple.eval-set.yaml for a minimal working example.

Required Fields

tasks:
  - package: git+https://github.com/UKGovernmentBEIS/inspect_evals
    name: inspect_evals
    items:
      - name: mbpp
        sample_ids: [1, 2, 3]  # Optional: test specific samples

models:
  - package: openai
    name: openai
    items:
      - name: gpt-4o-mini

Optional Fields

Agents/Solvers (agents is the newer name for solvers):

agents:
  - package: git+https://github.com/METR/inspect-agents
    name: metr_agents
    items:
      - name: react
        args:
          max_attempts: 3

Runner Configuration:

runner:
  secrets:
    - name: DATASET_ACCESS_KEY
      description: API key for dataset access
  environment:
    FOO_BAR: custom_value

packages:
  - git+https://github.com/some-package  # Additional packages to install

Inspect AI Parameters (passed to inspect.eval_set):

  • eval_set_id: Custom ID (generated if not specified)
  • limit: Maximum samples to evaluate
  • time_limit: Per-sample time limit in seconds
  • message_limit: Maximum messages per sample
  • epochs: Number of evaluation epochs
  • metadata: Custom metadata dictionary
  • tags: List of tags for organization

For the complete schema, see hawk/core/types/evals.py or the Inspect AI documentation.

Passing Environment Variables and Secrets

Use --secret or --secrets-file to pass environment variables to your evaluation:

# Single variable
hawk eval-set config.yaml --secret MY_API_KEY

# From file
hawk eval-set config.yaml --secrets-file .env

# Multiple files
hawk eval-set config.yaml --secrets-file .env --secrets-file .secrets.local

Secrets file format:

# .secrets
DATASET_API_KEY=your_key_here
CUSTOM_MODEL_KEY=another_key

API Keys: By default, Hawk uses a managed LLM proxy for OpenAI, Anthropic, and Google Vertex models. For other providers, pass API keys as secrets. You can override the proxy by setting OPENAI_API_KEY, ANTHROPIC_API_KEY, or VERTEX_API_KEY as secrets.

Required Secrets: Declare required secrets in your config using runner.secrets to prevent jobs from starting with missing credentials.

Running Scans

hawk scan examples/simple.scan.yaml

The Scan Config File

Like the eval set config file, the SCAN_CONFIG_FILE is a YAML file that defines a scan run.

name: my-scan # An optional pretty name for the scan run

scanners:
  - package: git+https://github.com/METR/inspect-agents
    name: metr_agents
    items:
      - name: reward_hacking_scanner

models:
  - package: openai
    name: openai
    items:
      - name: gpt-5

packages:
  # Any other packages to install in the venv where the job will run
  - git+https://github.com/DanielPolatajko/inspect_wandb[weave]

transcripts:
  sources:
    - eval_set_id: inspect-eval-set-s6m74hwcd7jag1gl
  filter:
    where:
      - eval_status: success
    limit: 10
    shuffle: true

metadata: dict[str, Any] | null
tags: list[str] | null

You can specify scanners[].items[].key to assign unique keys to different instances of the same scanner, e.g. to run it with different arguments.

Transcript Filtering

Scans analyze transcripts (execution logs) from previous evaluations. Use filters to select specific samples.

Common filter examples:

transcripts:
  sources:
    - eval_set_id: inspect-eval-set-abc123
  filter:
    where:
      - eval_status: success           # Only successful runs
      - score: {gt: 0.5}               # Score above 0.5
      - model: {like: "gpt-4%"}        # GPT-4 models only
      - metadata.agent.name: react     # Nested metadata access
    limit: 100
    shuffle: true

Available filter operators:

  • Equality: field: value or field: [val1, val2] (IN list)
  • Comparison: {gt: 0.5}, {ge: 0.5}, {lt: 1.0}, {le: 1.0}, {between: [0.5, 1.0]}
  • Pattern matching: {like: "pattern"}, {ilike: "PATTERN"} (case-insensitive)
  • Logical: {not: condition}, {or: [cond1, cond2]}
  • Null checks: field: null

Per-scanner filters: Use scanners[].items[].filter to override the global filter for specific scanners.

For the complete filter syntax, see hawk/core/types/scans.py.

Deployment

This repository provides a Terraform module for deploying Hawk to AWS. The infrastructure includes:

  • ECS Fargate for the FastAPI server
  • EKS for running evaluation jobs
  • Aurora PostgreSQL for the data warehouse
  • Lambda functions for log processing and access control
  • S3 for log storage

To deploy Hawk, reference the terraform/ directory as a module in your infrastructure Terraform project and deploy through your infrastructure pipeline (e.g., Spacelift).

See CONTRIBUTING.md for instructions on updating Inspect AI/Scout versions and running smoke tests.

CLI Reference

Authentication

hawk login                    # Log in via OAuth2 Device Authorization flow
hawk auth access-token        # Print valid access token to stdout
hawk auth refresh-token       # Print current refresh token

Running Eval Sets

hawk eval-set CONFIG.yaml [OPTIONS]

Run an Inspect eval set remotely. The config file contains a grid of tasks, solvers, and models.

Options:

Option Description
--image-tag TEXT Specify runner image tag
--secrets-file FILE Load environment variables from secrets file (can be repeated)
--secret TEXT Pass environment variable as secret (can be repeated)
--skip-confirm Skip confirmation prompt for unknown config warnings
--log-dir-allow-dirty Allow unrelated eval logs in log directory

Example:

hawk eval-set examples/simple.eval-set.yaml --secret OPENAI_API_KEY

Running Scout Scans

hawk scan CONFIG.yaml [OPTIONS]

Run a Scout scan remotely. The config file contains a matrix of scanners and models.

Options:

Option Description
--image-tag TEXT Specify runner image tag
--secrets-file FILE Load environment variables from secrets file (can be repeated)
--secret TEXT Pass environment variable as secret (can be repeated)
--skip-confirm Skip confirmation prompt for unknown config warnings

Example:

hawk scan examples/simple.scan.yaml

Resource Management

hawk delete [EVAL_SET_ID]     # Delete eval set and clean up resources (not logs)
hawk web [EVAL_SET_ID]        # Open eval set in web browser
hawk view-sample SAMPLE_UUID  # Open specific sample in web browser

If EVAL_SET_ID is not provided, uses the last eval set ID from the current session.

Sample Editing

hawk edit-samples EDITS_FILE

Submit sample edits to the Hawk API. Accepts JSON or JSONL files.

JSON format:

[
  {"sample_uuid": "...", "details": {"type": "score_edit", ...}},
  {"sample_uuid": "...", "details": {"type": "invalidate_sample", ...}}
]

JSONL format:

{"sample_uuid": "...", "details": {"type": "score_edit", ...}}
{"sample_uuid": "...", "details": {"type": "invalidate_sample", ...}}