How to Build a Desktop AI Agent with Claude Computer Use API

Every legacy system in your company has a GUI. Very few have an API. Until now, that made them automation dead zones. On March 23, 2026, Anthropic launched screen control in Claude Cowork and Claude Code—and while the consumer path gets the headlines, the real story for developers is an API that lets Claude see screenshots, click, type, and navigate any desktop application through a structured agent loop. The underlying Computer Use API beta has been available to developers since October 2024.

This tutorial covers the developer path—the API you can deploy anywhere Docker runs. We’ll walk through the reference implementation, break down how the agent loop works, fix the coordinate scaling bug that will break your first project, and lock down security so you aren’t handing an AI unsupervised desktop access.

What the Claude Computer Use API Actually Does

The API doesn’t give Claude a browser extension. It gives Claude a pair of eyes and hands. You send a screenshot, Claude analyzes it and returns a structured action—click at coordinates (450, 320), type “quarterly report,” press Enter—and your application executes that action in a sandboxed environment. Then you send a new screenshot, and the cycle repeats.

This is the API-bypass angle that matters: every legacy GUI is now programmable without custom integrations or RPA licensing fees. A 20-step task costs $0.08–0.15 in API tokens at Sonnet 4.6 pricing ($3/$15 per million tokens). UiPath’s unattended robot licenses run around $420/month. The math writes itself for any workflow stuck behind a system that never got an API.

Claude follows a three-tier priority order: direct connectors first (Slack, Gmail), browser navigation second, and screen interaction as a last resort. Tool versions matter—computer_20251124 for Opus 4.6, Sonnet 4.6, and Opus 4.5 (adds zoom); computer_20250124 for older models including Sonnet 4.5, Haiku 4.5, and Sonnet 3.7. Get this wrong and the tool silently fails.

For competitive context: Claude Sonnet 4.6 scores 72.5% on OSWorld-Verified, near the 72.4% human baseline. GPT-5.4’s computer use approach leads at 75.0%—but Claude offers full desktop control while OpenAI’s Operator and Google’s Mariner focus primarily on web navigation.

Prerequisites and Quick Start with Docker

You need: Python 3.8+, Docker Desktop, an Anthropic API key with access to Claude Sonnet 4.6 or Opus 4.6, and about 10 minutes. Anthropic’s reference implementation on GitHub ships a complete sandboxed desktop—Xvfb virtual display, Mutter window manager, Firefox, LibreOffice, and a Python agent loop. This is the fastest path from zero to working agent.

export ANTHROPIC_API_KEY=your-key-here

docker run \
    -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
    -v $HOME/.anthropic:/home/user/.anthropic \
    -p 5900:5900 \
    -p 8501:8501 \
    -p 6080:6080 \
    -p 8080:8080 \
    -it ghcr.io/anthropics/anthropic-quickstarts:computer-use-demo-latest

Port 5900 is VNC, 6080 is the noVNC web view, 8501 is the Streamlit interface, and 8080 serves the combined view. Navigate to localhost:8080 in your browser. You’ll see a Streamlit chat interface on the left and a live view of the virtual desktop on the right. Type a task—”Open Firefox and search for the latest AI news”—and watch Claude work.

Set the resolution to XGA 1024×768. Higher resolutions get downsampled before Claude sees them (more on why in the next section), and lower ones get padded. 1024×768 is the sweet spot Anthropic optimized for. The implementation supports the Claude API directly, AWS Bedrock, and Google Vertex AI—useful if you’re already inside one of those ecosystems.

How the Agent Loop Works

The core pattern behind every computer use task is a sampling loop. Understanding it is the difference between using the reference implementation and building your own.

  • Step 1: Send your task plus the computer tool definition to the Claude API.
  • Step 2: Claude returns a tool_use block with a structured action—screenshot, left_click with x/y coordinates, type with text, scroll, key press, or drag.
  • Step 3: Your application executes the action in the sandboxed environment.
  • Step 4: Return the result (almost always a fresh screenshot) back to Claude.
  • Step 5: Repeat until Claude returns a stop_reason of end_turn or you hit max_iterations.

The tool definition is schema-less. Unlike regular tools where you define an input schema, the computer use schema is built into Claude’s model. You only provide display_width_px, display_height_px, and optionally display_number and enable_zoom. Claude already knows what actions are available.

Token overhead adds up fast. The official Computer Use documentation specifies ~2,700 tokens in fixed overhead before your first message (466–499 for the system prompt, plus 735 per tool definition across the standard three tools). Each 1024×768 screenshot costs approximately 1,048 input tokens. A 20-step task with screenshots at each step runs $0.08–0.15 at Sonnet 4.6 pricing.

Add a max_iterations guard—typically 50–100—to prevent runaway costs if Claude gets stuck in a loop. The reference implementation’s loop.py handles this; study it before building your own. For extended sessions where screenshots accumulate fast, Claude’s Compaction API can manage context in long-running sessions.

One prompting pattern from Anthropic’s docs is worth memorizing:

“After each step, take a screenshot and carefully evaluate if you have achieved the right outcome. Explicitly show your thinking: ‘I have evaluated step X…’ If not correct, try again.”

This step-verify pattern prevents Claude from assuming success without checking. Without it, Claude will sometimes move to step 5 while step 3 actually failed—and you’ll spend more in wasted tokens debugging than you saved by skipping verification.

The Coordinate Scaling Gotcha

The API constrains images to 1,568 pixels on the longest edge and ~1.15 megapixels total. A 1512×982 Retina display gets downsampled to roughly 1330×864 before Claude analyzes it. Claude then returns click coordinates in that scaled space—but your tool executes those clicks in the original, larger resolution. Without scaling, every click misses its target.

The fix: resize screenshots before sending them to the API, then scale Claude’s returned coordinates back up.

scale_factor = min(1.0, 1568 / long_edge, sqrt(1150000 / total_pixels))

# Claude returns coordinates in scaled space
# Multiply by inverse to get original coordinates
original_x = claude_x * (1 / scale_factor)
original_y = claude_y * (1 / scale_factor)

The reference implementation sidesteps this entirely by running at 1024×768—no downsampling required. If you deviate from that resolution, implement the scaling math or your agent will miss every target.

Claude Computer Use API agent loop and desktop automation workflow

Security: Don’t Skip This Section

Anthropic ships a production-grade Docker kit with Bedrock/Vertex AI support and VNC remote access. The “research preview” label is liability management, not a description of readiness. The security risks are equally production-grade.

The critical threat is prompt injection. Claude follows instructions found in any content it sees on screen—a malicious webpage, a crafted email, a doctored image. Anthropic has added automatic classifiers that flag potential injections in screenshots, but their help center documentation is blunt: “These guardrails are part of how Claude is trained and instructed, but they aren’t absolute.”

Required mitigations if you’re running this outside a demo:

  • Isolation: Always run in a dedicated Docker container or VM with minimal privileges. Never on your primary machine.
  • Credential separation: No access to sensitive credentials, financial accounts, or production systems.
  • Domain allowlisting: Whitelist only the domains the task requires for any internet access.
  • Human-in-the-loop: Require confirmation for consequential actions—file deletion, form submission, agreeing to terms of service.

Anthropic blocks certain actions by default—stock trading, fund transfers, facial image scraping—but given the “aren’t absolute” caveat, plan accordingly. The OWASP AI Agent Security Cheat Sheet covers the full attack surface for production deployments. And close your password manager before enabling computer use in any environment where Claude has screen access.

Where to Take This Next

MindStudio documented savings of $15,000–30,000+ annually per person for legacy data entry alone. Three use cases worth building first for a realistic ROI test:

  • Legacy system data entry: Transfer spreadsheet rows into a form-based system in bulk—read row, fill form, submit, repeat.
  • UI/QA testing: Execute test plans across browser states and capture screenshots of failures. Claude describes what went wrong, so structured reports come free.
  • Competitive intelligence: Track pricing or product changes across competitor sites on a schedule.

Know the limits: a form fill takes 30–60 seconds, complex tasks take minutes, and anything with 2FA or CAPTCHAs on the critical path will stall. This is a precision tool, not a speed tool. If you’re in a TypeScript stack, the Vercel AI SDK’s computer use integration offers a typed abstraction layer. For multi-agent workflows, computer use agents slot in as tools within a larger pipeline—see our tutorial on multi-agent orchestration patterns.

The Selenium Moment

The developers learning these primitives today are doing the equivalent of writing raw Selenium tests in 2004. In two years, abstraction layers will hide the coordinate scaling, the token math, the screenshot loops. But the people who understand what’s underneath will be the ones building those layers. As SiliconANGLE reported, Constellation Research analyst Holger Mueller put it plainly: “In combo with Claude Dispatch, it can enable new levels of productivity, especially for developers.”

Claude’s OSWorld trajectory—14.9% to 72.5% in 16 months—suggests the next model release could push past GPT-5.4’s 75.0% threshold. The question isn’t whether computer use gets fast enough to replace RPA. It’s whether the agent replaces the application entirely, rather than just automating it.

Get the Daily Pulse

Sharp analysis on what's actually moving in AI. No hype, no filler, no weekly digest.

Get the Daily Pulse

Sharp AI analysis, daily. Two minutes, every morning.

Get the Daily PulseTwo minutes, every morning