Defining Tasks
Tasks are instances of scenarios with specific arguments. Define them in atasks.py file using scenario.task():
slug is a stable identifier used for syncing and filtering.
Running Locally
Three modes depending on what you need:Pure Python — hud eval
For environments that only need Python (no system deps, no databases, no browsers). Everything runs in-process:
Interactive — hud dev
For fast iteration on a single scenario. Spawn your environment as an MCP server and connect from Cursor, Claude Code, or any MCP client:
Docker — hud build + connect_image
For environments that need system dependencies (PostgreSQL, browsers, VNC, GPU libraries). Build the image, then connect to it from a test script:
connect_image spins up the container, connects via MCP, and tears it down when done. Your tools run inside the container where the system deps live; your test script runs outside.
Note: hud eval tasks.py imports your env.py directly (in-process). For Docker environments, write a separate test script that uses connect_image as shown above, or use hud dev for interactive Docker development:
When to Use What
| Mode | System deps? | Speed | Use case |
|---|---|---|---|
hud eval / task.run() | No | Fastest | Pure Python envs, rapid iteration |
hud dev | Optional | Fast (hot-reload) | Interactive development, single scenario |
hud build + connect_image | Yes | Slower (container) | Databases, browsers, GPU, full integration |
Deploying
Deploy your environment to the platform:hud sync can link your local tasks to a platform taskset.
See Deploy for full details on hud deploy and GitHub auto-deploy.
Syncing to Platform
Once deployed, sync your local tasks to a platform taskset:hud sync diffs local tasks against the platform by slug. It creates new tasks, updates changed ones, and reports tasks that exist remotely but not locally (without deleting them).
Managing on Platform
Once synced, your taskset is live on hud.ai/evalsets. You can also create and edit tasks through the platform UI — useful for large-scale management, team collaboration, or one-off additions.
hud sync work together. Edit locally and sync up, or edit on the platform — both are valid. See Platform Tasksets for the full guide.
Running at Scale
Run your taskset remotely on HUD infrastructure:hud eval CLI reference for all options.
Iterating
Once deployed and synced, use this to decide what to re-run:| What you changed | Deploy? | Sync? | Notes |
|---|---|---|---|
| Task prompt or args | No | Yes | hud sync tasks --task <slug> |
| Added/removed a task | No | Yes | Existing remote tasks are never deleted |
Grading logic, tools, env.py | Yes | No | Test locally first with hud dev |
| Dockerfile or system deps | Yes | No | Use --no-cache if layer caching causes stale state |
- Edit scenario/grading locally
hud build && hud dev my_task— verify it works- Deploy only if env code changed:
hud deploy - Sync only if tasks changed:
hud sync tasks --task <slug> hud eval "My Taskset" claude --remote --task-ids my_task- Check the trace, repeat
Debugging
Build your image locally first, then debug:Debugging Zero Scores
When a task scores 0.0 remotely but works locally:- Trace — Did the agent attempt the task or get stuck? If it never acted, the issue is the prompt or model, not grading.
- Logs — Check the LOGS and DEBUG tabs for container errors, missing deps, or grader failures.
- Grade locally — If it passes locally but fails remotely, the deployed environment diverged. Verify your latest build version matches
.hud/deploy.json.
See Also
Best Practices
Patterns for good environments, evals, and grading
Deploy
Deploy your environment to the platform
Platform Tasksets
Full taskset management guide