HUD Documentation — Evaluations and RL Environments.

Define tasks in code, test locally, deploy, sync to the platform, run at scale.

Defining Tasks

Tasks are instances of scenarios with specific arguments. Define them in a tasks.py file using scenario.task():

from env import count

strawberry_r = count.task(text="strawberry", letter="r")
strawberry_r.slug = "strawberry-r"

mississippi_s = count.task(text="mississippi", letter="s")
mississippi_s.slug = "mississippi-s"

abracadabra_a = count.task(text="abracadabra", letter="a")
abracadabra_a.slug = "abracadabra-a"

Each task binds a scenario to specific arguments. The slug is a stable identifier used for syncing and filtering.

Running Locally

Three modes depending on what you need:

Pure Python — `hud eval`

For environments that only need Python (no system deps, no databases, no browsers). Everything runs in-process:

# Run first task
hud eval tasks.py

# Run all tasks
hud eval tasks.py --full

# Run specific tasks by slug
hud eval tasks.py --task-ids echo-hello,add-simple

# Run with a specific model
hud eval tasks.py claude --full

Or run a single task programmatically:

result = await env("count", sentence="Strawberry", letter="r").run("claude-sonnet-4-5")
print(f"Reward: {result.reward}")

This is the fastest iteration loop — no Docker, no server, just Python.

Interactive — `hud dev`

For fast iteration on a single scenario. Spawn your environment as an MCP server and connect from Cursor, Claude Code, or any MCP client:

hud dev env:env -w env.py -w tools/

Then in Cursor’s MCP settings:

{
  "my-dev-env": { "url": "http://localhost:8765/mcp" }
}

Your coding agent can call your tools directly. Edit, save, the server hot-reloads. Great for developing and debugging individual scenarios interactively.

Docker — `hud build` + `connect_image`

For environments that need system dependencies (PostgreSQL, browsers, VNC, GPU libraries). Build the image, then connect to it from a test script:

hud build .

from hud import Environment

env = Environment("my-env")
env.connect_image("my-env:latest")

result = await env("checkout", product="laptop").run("claude-sonnet-4-5")

connect_image spins up the container, connects via MCP, and tears it down when done. Your tools run inside the container where the system deps live; your test script runs outside. Note: hud eval tasks.py imports your env.py directly (in-process). For Docker environments, write a separate test script that uses connect_image as shown above, or use hud dev for interactive Docker development:

hud dev env:env -w env.py    # builds + runs container, hot-reloads Python

When to Use What

Mode	System deps?	Speed	Use case
`hud eval` / `task.run()`	No	Fastest	Pure Python envs, rapid iteration
`hud dev`	Optional	Fast (hot-reload)	Interactive development, single scenario
`hud build` + `connect_image`	Yes	Slower (container)	Databases, browsers, GPU, full integration

Deploying

You must deploy your environment before syncing tasks or running remotely. Deploy first, then sync.

Deploy your environment to the platform:

hud deploy

This builds your Docker image, uploads it, and registers the environment. Your scenarios become available for task creation, and hud sync can link your local tasks to a platform taskset. See Deploy for full details on hud deploy and GitHub auto-deploy.

Syncing to Platform

Once deployed, sync your local tasks to a platform taskset:

# First sync — creates the taskset
hud sync tasks my-taskset

# Re-sync after changes — shows a diff
hud sync tasks my-taskset
#   create: add-negative (new)
#   update: add-simple (args changed)
#   remote-only: old-task (exists on platform but not locally)

# After first sync, bare `hud sync` reuses stored config
hud sync

hud sync diffs local tasks against the platform by slug. It creates new tasks, updates changed ones, and reports tasks that exist remotely but not locally (without deleting them).

Managing on Platform

Once synced, your taskset is live on hud.ai/evalsets. You can also create and edit tasks through the platform UI — useful for large-scale management, team collaboration, or one-off additions.

The platform UI and hud sync work together. Edit locally and sync up, or edit on the platform — both are valid. See Platform Tasksets for the full guide.

Running at Scale

Run your taskset remotely on HUD infrastructure:

# Run by taskset name
hud eval my-taskset claude --remote --full

# Multiple repeats for variance
hud eval my-taskset claude --remote --full --group-size 5

Or click Run Taskset on the platform UI — select models, group size, max steps, and launch. Both the agent and environment run remotely. No local compute, no local Docker. Results show up in real-time on the taskset’s Leaderboard tab. Monitor at hud.ai/jobs. See hud eval CLI reference for all options.

Iterating

Once deployed and synced, use this to decide what to re-run:

What you changed	Deploy?	Sync?	Notes
Task prompt or args	No	Yes	`hud sync tasks --task <slug>`
Added/removed a task	No	Yes	Existing remote tasks are never deleted
Grading logic, tools, `env.py`	Yes	No	Test locally first with `hud dev`
Dockerfile or system deps	Yes	No	Use `--no-cache` if layer caching causes stale state

The fastest cycle when developing a single task:

Edit scenario/grading locally
hud build && hud dev my_task — verify it works
Deploy only if env code changed: hud deploy
Sync only if tasks changed: hud sync tasks --task <slug>
hud eval "My Taskset" claude --remote --task-ids my_task
Check the trace, repeat

Debugging

Build your image locally first, then debug:

hud build .
hud debug my-env:latest

Shows exactly which phase failed:

✓ Phase 1: Docker image exists
✓ Phase 2: MCP server responds to initialize
✗ Phase 3: Tool discovery failed
  → Error: Connection refused on port 8005

hud debug .                    # Build and debug current directory
hud debug . --max-phase 3      # Stop after phase 3

Debugging Zero Scores

When a task scores 0.0 remotely but works locally:

Trace — Did the agent attempt the task or get stuck? If it never acted, the issue is the prompt or model, not grading.
Logs — Check the LOGS and DEBUG tabs for container errors, missing deps, or grader failures.
Grade locally — If it passes locally but fails remotely, the deployed environment diverged. Verify your latest build version matches .hud/deploy.json.

Best Practices

Patterns for good environments, evals, and grading

Deploy

Deploy your environment to the platform

Platform Tasksets

Full taskset management guide

Get Started

Building Environments

Running Agents

Advanced

SDK Reference

Tools Reference

Cookbooks

CLI Reference

Community

Testing & Evaluation

Defining Tasks

Running Locally

Pure Python — `hud eval`

Interactive — `hud dev`

Docker — `hud build` + `connect_image`

When to Use What

Deploying

Syncing to Platform

Managing on Platform

Running at Scale

Iterating

Debugging

Debugging Zero Scores

See Also

Best Practices

Deploy

Platform Tasksets

Get Started

Building Environments

Running Agents

Advanced

SDK Reference

Tools Reference

Cookbooks

CLI Reference

Community

​Defining Tasks

​Running Locally

​Pure Python — hud eval

​Interactive — hud dev

​Docker — hud build + connect_image

​When to Use What

​Deploying

​Syncing to Platform

​Managing on Platform

​Running at Scale

​Iterating

​Debugging

​Debugging Zero Scores

​See Also

Best Practices

Deploy

Platform Tasksets

Defining Tasks

Running Locally

Pure Python — `hud eval`

Interactive — `hud dev`

Docker — `hud build` + `connect_image`

When to Use What

Deploying

Syncing to Platform

Managing on Platform

Running at Scale

Iterating

Debugging

Debugging Zero Scores

See Also