Bytebot: Open-Source AI Desktop Agent

An AI that has its own computer to complete tasks for you

ℹ️ A Holo-based variant is also in progress: Bytebot Hawkeye Holo, powered by the Holo 1.5 7B model.

Resources & Translations

🌐 Website • 📚 Documentation • 💬 Discord • 𝕏 Twitter

Hawkeye Fork Enhancements

Hawkeye layers precision tooling on top of upstream Bytebot so the agent can land clicks with far greater reliability:

Capability	Hawkeye	Upstream Bytebot
Grid overlay guidance	Always-on 100 px grid with labeled axes and optional debug overlays toggled via `BYTEBOT_GRID_OVERLAY`/`BYTEBOT_GRID_DEBUG`, plus a live preview in the overlay capture.	No persistent spatial scaffolding; relies on raw screenshots.
Smart Focus targeting	Three-stage coarse→focus→click workflow with tunable grids and prompts described in Smart Focus System.	Single-shot click reasoning without structured zoom or guardrails.
Progressive zoom capture	Deterministic zoom ladder with cyan micro-grids that map local→global coordinates; see zoom samples.	Manual zoom commands with no coordinate reconciliation.
Coordinate telemetry & accuracy	Telemetry pipeline with `BYTEBOT_COORDINATE_METRICS` and `BYTEBOT_COORDINATE_DEBUG`, an attempt towards accuracy.(COORDINATE_ACCURACY_IMPROVEMENTS.md).	No automated accuracy measurement or debug dataset.
Universal coordinate mapping	Shared lookup in `config/universal-coordinates.yaml` bundled in repo and `@bytebot/shared`, auto-discovered without extra configuration.	Requires custom configuration for consistent coordinate frames.
Universal element detection	CV pipeline merges visual heuristics, OCR enrichments, and semantic roles to emit consistent `UniversalUIElement` metadata for buttons, inputs, and clickable controls.	LLM prompts must infer UI semantics from raw OCR spans and manually chosen click targets.
Enhanced OpenCV 4.6.0+ pipeline	Multi-method detection with template matching, feature detection (ORB/AKAZE), contour analysis, and advanced OCR preprocessing with morphological operations and CLAHE enhancement.	Basic screenshot analysis without advanced computer vision techniques.
Real-time CV activity monitoring	Live tracking of active computer vision methods with performance metrics, success rates, and UI indicators showing which CV methods are processing. API endpoints and SSE streams for real-time visibility.	No visibility into which detection methods are active or their performance characteristics.
Accessible UI theming	Header theme toggle powered by Next.js theme switching delivers high-contrast light/dark palettes so operators can pick the most legible view.	Single default theme without in-app toggles.
Active Model desktop telemetry	The desktop dashboard's Active Model card (under `/desktop`) continuously surfaces the agent's current provider, model alias, and streaming heartbeat so you can spot token stalls before they derail long-running sessions.	No dedicated real-time status card—operators must tail logs to confirm the active model.

Flip individual systems off by setting the corresponding environment variables—BYTEBOT_UNIVERSAL_TEACHING, BYTEBOT_ADAPTIVE_CALIBRATION, BYTEBOT_ZOOM_REFINEMENT, or BYTEBOT_COORDINATE_METRICS—to false (default true). Enable deep-dive logs with BYTEBOT_COORDINATE_DEBUG=true when troubleshooting. Visit the /desktop route (see the screenshot above) to monitor the Active Model card while long-running tasks execute.

Smart Focus Targeting (Hawkeye Exclusive)

The fork’s Smart Focus workflow narrows attention in three deliberate passes—coarse region selection, focused capture, and final click—so the agent can reason about targets instead of guessing. Enable or tune it with BYTEBOT_SMART_FOCUS, BYTEBOT_OVERVIEW_GRID, BYTEBOT_REGION_GRID, BYTEBOT_FOCUSED_GRID, and related knobs documented in docs/SMART_FOCUS_SYSTEM.md.

Desktop Accuracy Drawer

The /desktop dashboard now ships with a Desktop Accuracy drawer that exposes the fork’s adaptive telemetry at a glance. The panel streams live stats for the currently selected session, lets operators jump between historical sessions with the session selector, and provides reset controls so you can zero out a learning run before capturing a new benchmark. Use the reset button to clear the in-memory metrics without restarting the daemon when you want a clean baseline for regression tests or demonstrations.

Learning Metrics Explained

To help you interpret the drawer’s live readouts, Hawkeye surfaces several learning metrics that highlight how the desktop agent is adapting:

Attempt count — The number of clicks evaluated during the active session. Use it to gauge how large the sample is before trusting the aggregate metrics.
Success rate — Percentage of attempts that landed within the configured smart click success radius. This reflects real-time precision as the agent iterates on a task.
Weighted offsets — Average X/Y drift in pixels, weighted by recency so the panel emphasizes the most recent behavior. Watch this to see whether recent tuning is nudging the cursor closer to targets.
Convergence — A decay-weighted score that trends toward 1.0 as the agent stops overshooting targets, signaling that the current calibration has stabilized.
Hotspots — Highlighted regions where misses cluster, helping you identify UI zones that need larger affordances, different prompts, or manual overrides.

Together, these metrics give you continuous feedback on how Hawkeye’s coordinate calibration improves over time and whether additional guardrails are necessary for stubborn workflows.

Enhanced Computer Vision Pipeline (OpenCV 4.6.0+)

Hawkeye leverages a comprehensive OpenCV 4.6.0 computer vision pipeline that dramatically improves UI automation accuracy through multiple detection methods:

Multi-Method Element Detection

Template Matching (TemplateMatcherService) - Multi-scale template matching with confidence scoring for pixel-perfect UI element matching
Feature Detection (FeatureMatcherService) - ORB and AKAZE feature matching with homography-based element localization, robust to UI variations
Contour Analysis (ContourDetectorService) - Shape-based detection for buttons, input fields, and icons using advanced morphological operations
Enhanced OCR Pipeline - Upgraded Tesseract preprocessing with morphological gradients, bilateral filtering, and CLAHE contrast enhancement

Intelligent Detection Orchestration

The EnhancedVisualDetectorService combines all CV methods intelligently:

// Comprehensive detection using all available methods
const result = await enhancedDetector.detectElements(screenshot, template, {
  useTemplateMatching: true,   // For exact UI matches
  useFeatureMatching: true,    // For robust element detection
  useContourDetection: true,   // For shape-based detection
  useOCR: true,               // For text-based elements
  combineResults: true        // Merge overlapping detections
});

Real-Time CV Activity Monitoring

Live Method Tracking - CVActivityIndicatorService tracks which CV methods are actively processing
Performance Metrics - Real-time success rates, processing times, and execution statistics
UI Integration - Server-Sent Events and REST endpoints for displaying active CV methods in the web interface
Debug Telemetry - Comprehensive method execution history for optimization

API Endpoints for CV Visibility

GET /cv-activity/status     # Current active methods snapshot
GET /cv-activity/active     # Quick active/inactive check
SSE /cv-activity/stream     # Real-time updates via Server-Sent Events
GET /cv-activity/performance # Method performance statistics

Universal Element Detection Pipeline

The enhanced system outputs structured UniversalUIElement objects by fusing:

Visual pattern detection (VisualPatternDetectorService) with OpenCV edge detection, CLAHE, and morphological operations
OCR enrichment through ElementDetectorService.detectElementsUniversal with advanced preprocessing techniques
Semantic analysis (TextSemanticAnalyzerService) for intent-based reasoning over raw UI text
Multi-method fusion that combines template, feature, contour, and OCR detections for maximum reliability

OpenCV 4.6.0 Capability Matrix

✅ Active Methods: Template matching, Feature detection (ORB/AKAZE), Morphological operations, CLAHE, Gaussian blur, Bilateral filtering, Canny edge detection, Contour analysis, Adaptive thresholding

✅ Preprocessing Pipeline: Multi-scale image processing, Noise reduction, Contrast enhancement, Edge enhancement, Shape analysis

✅ UI Automation Features: Button detection, Input field identification, Icon recognition, Text extraction, Clickable element mapping

Keyboard & Shortcut Reliability

NutService on the desktop daemon parses compound shortcuts such as ctrl+shift+p, mixed-case modifiers, and platform aliases (cmd, option, win). Legacy arrays like ['Control', 'Shift', 'X'] continue to work, but LLM tool calls can now emit compact strings and rely on the daemon to normalize, validate, and execute the correct nut-js sequence.

Running the Default Compose Stack

To boot the Hawkeye desktop, agent, UI, and Postgres services without the LiteLLM proxy, use the standard compose topology:

docker compose -f docker/docker-compose.yml build
docker compose -f docker/docker-compose.yml up -d

Default endpoints:

Agent API – http://localhost:9991
Web UI – http://localhost:9992
Desktop noVNC – http://localhost:9990

If the UI reports ECONNREFUSED to 9991, ensure the agent container is healthy (docker compose ps bytebot-agent) and inspect its logs (docker compose logs bytebot-agent) for Prisma or DI failures.

Quick Start: Proxy Compose Stack

The fastest way to try Hawkeye is the proxy-enabled Docker Compose stack—it starts the desktop, agent, UI, Postgres, and LiteLLM proxy with every precision upgrade flipped on. Populate docker/.env with your model keys and the Hawkeye-specific toggles before you launch. OpenRouter and LMStudio are first-class in the default LiteLLM config, so set the matching environment variables and make sure the aliases in packages/bytebot-llm-proxy/litellm-config.yaml point to models you can reach:

OPENROUTER_API_KEY powers the openrouter-* aliases like openrouter-claude-3.7-sonnet.
LMStudio examples such as local-lmstudio-gemma-3-27b expect your local server’s api_base to match the running LMStudio instance.
BYTEBOT_GRID_OVERLAY=true keeps the labeled coordinate grid on every capture.
BYTEBOT_PROGRESSIVE_ZOOM_USE_AI=true enables the multi-zoom screenshot refinement.
BYTEBOT_SMART_FOCUS=true and BYTEBOT_SMART_FOCUS_MODEL=<litellm-alias> route Smart Focus through the proxy model you configure in the LiteLLM config.
BYTEBOT_COORDINATE_METRICS=true (plus optional BYTEBOT_COORDINATE_DEBUG=true) records the click accuracy telemetry that distinguishes the fork.
BYTEBOT_CV_ENHANCED_DETECTION=true enables the comprehensive OpenCV 4.6.0+ pipeline with template, feature, and contour detection.
BYTEBOT_CV_ACTIVITY_TRACKING=true activates real-time CV method monitoring and performance metrics for UI visibility.

cat <<'EOF' > docker/.env
# Provider keys for LiteLLM
OPENAI_API_KEY=sk-your-key
ANTHROPIC_API_KEY=...
OPENROUTER_API_KEY=...

# Hawkeye precision defaults
BYTEBOT_GRID_OVERLAY=true
BYTEBOT_PROGRESSIVE_ZOOM_USE_AI=true
BYTEBOT_SMART_FOCUS=true
BYTEBOT_SMART_FOCUS_MODEL=gpt-4o-mini
BYTEBOT_COORDINATE_METRICS=true

# Enhanced Computer Vision (OpenCV 4.6.0+)
BYTEBOT_CV_ENHANCED_DETECTION=true
BYTEBOT_CV_ACTIVITY_TRACKING=true
EOF

# Build images locally (default)
docker compose -f docker/docker-compose.proxy.yml up -d --build

# Prefer a different Postgres registry?
# export BYTEBOT_POSTGRES_IMAGE=postgres:16-alpine

# Or use pre-built registry images after authenticating:
# export BYTEBOT_DESKTOP_IMAGE=ghcr.io/bytebot-ai/bytebot-desktop:edge
# export BYTEBOT_AGENT_IMAGE=ghcr.io/bytebot-ai/bytebot-agent:edge
# export BYTEBOT_UI_IMAGE=ghcr.io/bytebot-ai/bytebot-ui:edge
# docker compose -f docker/docker-compose.proxy.yml up -d

Before you start the stack, edit packages/bytebot-llm-proxy/litellm-config.yaml so each alias maps to the OpenRouter endpoints or LMStudio bases you control. After saving changes, restart the bytebot-llm-proxy container (docker compose restart bytebot-llm-proxy) to reload the updated routing.

Alternative Deployments

Looking for a different hosting environment? Follow the upstream guides for the full walkthroughs:

Stay in Sync with Upstream Bytebot

For a full tour of the core desktop agent, installation options, and API surface, follow the upstream README and docs. Hawkeye inherits everything there—virtual desktop orchestration, task APIs, and deployment guides—so this fork focuses documentation on the precision tooling and measurement upgrades described above.

Operations & Tuning

Smart Click Success Radius

Smart click telemetry now records the real cursor landing position. Tune the pass/fail threshold by setting an environment variable on the desktop daemon:

export BYTEBOT_SMART_CLICK_SUCCESS_RADIUS=12  # pixels of acceptable drift

Increase the value if the VNC stream or hardware introduces more cursor drift, or decrease it to tighten the definition of a successful AI-guided click.

Name		Name	Last commit message	Last commit date
Latest commit History 865 Commits
.github/workflows		.github/workflows
config		config
docker		docker
docs		docs
helm		helm
package/install		package/install
packages		packages
static		static
tests		tests
.gitignore		.gitignore
.prettierignore		.prettierignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CLEANUP_SUMMARY.md		CLEANUP_SUMMARY.md
COORDINATE_ACCURACY_IMPROVEMENTS.md		COORDINATE_ACCURACY_IMPROVEMENTS.md
LICENSE		LICENSE
README.md		README.md
backslash.png		backslash.png
docker-quick-start.md		docker-quick-start.md
package-lock.json		package-lock.json
package.json		package.json
run-docker.sh		run-docker.sh
test-mock-ui.png		test-mock-ui.png
test-region-center.png		test-region-center.png
test-region-top-left.png		test-region-top-left.png
test-region-top-right.png		test-region-top-right.png
test-zoom-2x.png		test-zoom-2x.png
test-zoom-4x.png		test-zoom-4x.png
test-zoom-with-grid.png		test-zoom-with-grid.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bytebot: Open-Source AI Desktop Agent

Hawkeye Fork Enhancements

Smart Focus Targeting (Hawkeye Exclusive)

Desktop Accuracy Drawer

Learning Metrics Explained

Enhanced Computer Vision Pipeline (OpenCV 4.6.0+)

Multi-Method Element Detection

Intelligent Detection Orchestration

Real-Time CV Activity Monitoring

API Endpoints for CV Visibility

Universal Element Detection Pipeline

OpenCV 4.6.0 Capability Matrix

Keyboard & Shortcut Reliability

Running the Default Compose Stack

Quick Start: Proxy Compose Stack

Alternative Deployments

Stay in Sync with Upstream Bytebot

Further Reading

Operations & Tuning

Smart Click Success Radius

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Bytebot: Open-Source AI Desktop Agent

Hawkeye Fork Enhancements

Smart Focus Targeting (Hawkeye Exclusive)

Desktop Accuracy Drawer

Learning Metrics Explained

Enhanced Computer Vision Pipeline (OpenCV 4.6.0+)

Multi-Method Element Detection

Intelligent Detection Orchestration

Real-Time CV Activity Monitoring

API Endpoints for CV Visibility

Universal Element Detection Pipeline

OpenCV 4.6.0 Capability Matrix

Keyboard & Shortcut Reliability

Running the Default Compose Stack

Quick Start: Proxy Compose Stack

Alternative Deployments

Stay in Sync with Upstream Bytebot

Further Reading

Operations & Tuning

Smart Click Success Radius

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages