Skip to main content
Define tasks in code, test locally, deploy, sync to the platform, run at scale.

Defining Tasks

Tasks are instances of scenarios with specific arguments. Define them in a tasks.py file using scenario.task():
from env import count

strawberry_r = count.task(text="strawberry", letter="r")
strawberry_r.slug = "strawberry-r"

mississippi_s = count.task(text="mississippi", letter="s")
mississippi_s.slug = "mississippi-s"

abracadabra_a = count.task(text="abracadabra", letter="a")
abracadabra_a.slug = "abracadabra-a"
Each task binds a scenario to specific arguments. The slug is a stable identifier used for syncing and filtering.

Running Locally

Three modes depending on what you need:

Pure Python — hud eval

For environments that only need Python (no system deps, no databases, no browsers). Everything runs in-process:
# Run first task
hud eval tasks.py

# Run all tasks
hud eval tasks.py --full

# Run specific tasks by slug
hud eval tasks.py --task-ids echo-hello,add-simple

# Run with a specific model
hud eval tasks.py claude --full
Or run a single task programmatically:
result = await env("count", sentence="Strawberry", letter="r").run("claude-sonnet-4-5")
print(f"Reward: {result.reward}")
This is the fastest iteration loop — no Docker, no server, just Python.

Interactive — hud dev

For fast iteration on a single scenario. Spawn your environment as an MCP server and connect from Cursor, Claude Code, or any MCP client:
hud dev env:env -w env.py -w tools/
Then in Cursor’s MCP settings:
{
  "my-dev-env": { "url": "http://localhost:8765/mcp" }
}
Your coding agent can call your tools directly. Edit, save, the server hot-reloads. Great for developing and debugging individual scenarios interactively.

Docker — hud build + connect_image

For environments that need system dependencies (PostgreSQL, browsers, VNC, GPU libraries). Build the image, then connect to it from a test script:
hud build .
from hud import Environment

env = Environment("my-env")
env.connect_image("my-env:latest")

result = await env("checkout", product="laptop").run("claude-sonnet-4-5")
connect_image spins up the container, connects via MCP, and tears it down when done. Your tools run inside the container where the system deps live; your test script runs outside. Note: hud eval tasks.py imports your env.py directly (in-process). For Docker environments, write a separate test script that uses connect_image as shown above, or use hud dev for interactive Docker development:
hud dev env:env -w env.py    # builds + runs container, hot-reloads Python

When to Use What

ModeSystem deps?SpeedUse case
hud eval / task.run()NoFastestPure Python envs, rapid iteration
hud devOptionalFast (hot-reload)Interactive development, single scenario
hud build + connect_imageYesSlower (container)Databases, browsers, GPU, full integration

Deploying

You must deploy your environment before syncing tasks or running remotely. Deploy first, then sync.
Deploy your environment to the platform:
hud deploy
This builds your Docker image, uploads it, and registers the environment. Your scenarios become available for task creation, and hud sync can link your local tasks to a platform taskset. See Deploy for full details on hud deploy and GitHub auto-deploy.

Syncing to Platform

Once deployed, sync your local tasks to a platform taskset:
# First sync — creates the taskset
hud sync tasks my-taskset

# Re-sync after changes — shows a diff
hud sync tasks my-taskset
#   create: add-negative (new)
#   update: add-simple (args changed)
#   remote-only: old-task (exists on platform but not locally)

# After first sync, bare `hud sync` reuses stored config
hud sync
hud sync diffs local tasks against the platform by slug. It creates new tasks, updates changed ones, and reports tasks that exist remotely but not locally (without deleting them).

Managing on Platform

Once synced, your taskset is live on hud.ai/evalsets. You can also create and edit tasks through the platform UI — useful for large-scale management, team collaboration, or one-off additions.
Creating tasks from scenarios
The platform UI and hud sync work together. Edit locally and sync up, or edit on the platform — both are valid. See Platform Tasksets for the full guide.

Running at Scale

Run your taskset remotely on HUD infrastructure:
# Run by taskset name
hud eval my-taskset claude --remote --full

# Multiple repeats for variance
hud eval my-taskset claude --remote --full --group-size 5
Or click Run Taskset on the platform UI — select models, group size, max steps, and launch. Both the agent and environment run remotely. No local compute, no local Docker. Results show up in real-time on the taskset’s Leaderboard tab. Monitor at hud.ai/jobs. See hud eval CLI reference for all options.

Iterating

Once deployed and synced, use this to decide what to re-run:
What you changedDeploy?Sync?Notes
Task prompt or argsNoYeshud sync tasks --task <slug>
Added/removed a taskNoYesExisting remote tasks are never deleted
Grading logic, tools, env.pyYesNoTest locally first with hud dev
Dockerfile or system depsYesNoUse --no-cache if layer caching causes stale state
The fastest cycle when developing a single task:
  1. Edit scenario/grading locally
  2. hud build && hud dev my_task — verify it works
  3. Deploy only if env code changed: hud deploy
  4. Sync only if tasks changed: hud sync tasks --task <slug>
  5. hud eval "My Taskset" claude --remote --task-ids my_task
  6. Check the trace, repeat

Debugging

Build your image locally first, then debug:
hud build .
hud debug my-env:latest
Shows exactly which phase failed:
✓ Phase 1: Docker image exists
✓ Phase 2: MCP server responds to initialize
✗ Phase 3: Tool discovery failed
  → Error: Connection refused on port 8005
hud debug .                    # Build and debug current directory
hud debug . --max-phase 3      # Stop after phase 3

Debugging Zero Scores

When a task scores 0.0 remotely but works locally:
  1. Trace — Did the agent attempt the task or get stuck? If it never acted, the issue is the prompt or model, not grading.
  2. Logs — Check the LOGS and DEBUG tabs for container errors, missing deps, or grader failures.
  3. Grade locally — If it passes locally but fails remotely, the deployed environment diverged. Verify your latest build version matches .hud/deploy.json.

See Also

Best Practices

Patterns for good environments, evals, and grading

Deploy

Deploy your environment to the platform

Platform Tasksets

Full taskset management guide