Skip to content

zhound420/bytebot-hawkeye

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

865 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bytebot Logo

Bytebot: Open-Source AI Desktop Agent

An AI that has its own computer to complete tasks for you

ℹ️ A Holo-based variant is also in progress: Bytebot Hawkeye Holo, powered by the Holo 1.5 7B model.

Deploy on Railway

Resources & Translations

Hawkeye Fork Enhancements

Hawkeye layers precision tooling on top of upstream Bytebot so the agent can land clicks with far greater reliability:

Capability Hawkeye Upstream Bytebot
Grid overlay guidance Always-on 100 px grid with labeled axes and optional debug overlays toggled via BYTEBOT_GRID_OVERLAY/BYTEBOT_GRID_DEBUG, plus a live preview in the overlay capture. No persistent spatial scaffolding; relies on raw screenshots.
Smart Focus targeting Three-stage coarse→focus→click workflow with tunable grids and prompts described in Smart Focus System. Single-shot click reasoning without structured zoom or guardrails.
Progressive zoom capture Deterministic zoom ladder with cyan micro-grids that map local→global coordinates; see zoom samples. Manual zoom commands with no coordinate reconciliation.
Coordinate telemetry & accuracy Telemetry pipeline with BYTEBOT_COORDINATE_METRICS and BYTEBOT_COORDINATE_DEBUG, an attempt towards accuracy.(COORDINATE_ACCURACY_IMPROVEMENTS.md). No automated accuracy measurement or debug dataset.
Universal coordinate mapping Shared lookup in config/universal-coordinates.yaml bundled in repo and @bytebot/shared, auto-discovered without extra configuration. Requires custom configuration for consistent coordinate frames.
Universal element detection CV pipeline merges visual heuristics, OCR enrichments, and semantic roles to emit consistent UniversalUIElement metadata for buttons, inputs, and clickable controls. LLM prompts must infer UI semantics from raw OCR spans and manually chosen click targets.
Enhanced OpenCV 4.6.0+ pipeline Multi-method detection with template matching, feature detection (ORB/AKAZE), contour analysis, and advanced OCR preprocessing with morphological operations and CLAHE enhancement. Basic screenshot analysis without advanced computer vision techniques.
Real-time CV activity monitoring Live tracking of active computer vision methods with performance metrics, success rates, and UI indicators showing which CV methods are processing. API endpoints and SSE streams for real-time visibility. No visibility into which detection methods are active or their performance characteristics.
Accessible UI theming Header theme toggle powered by Next.js theme switching delivers high-contrast light/dark palettes so operators can pick the most legible view. Single default theme without in-app toggles.
Active Model desktop telemetry The desktop dashboard's Active Model card (under /desktop) continuously surfaces the agent's current provider, model alias, and streaming heartbeat so you can spot token stalls before they derail long-running sessions. No dedicated real-time status card—operators must tail logs to confirm the active model.

Flip individual systems off by setting the corresponding environment variables—BYTEBOT_UNIVERSAL_TEACHING, BYTEBOT_ADAPTIVE_CALIBRATION, BYTEBOT_ZOOM_REFINEMENT, or BYTEBOT_COORDINATE_METRICS—to false (default true). Enable deep-dive logs with BYTEBOT_COORDINATE_DEBUG=true when troubleshooting. Visit the /desktop route (see the screenshot above) to monitor the Active Model card while long-running tasks execute.

Smart Focus Targeting (Hawkeye Exclusive)

The fork’s Smart Focus workflow narrows attention in three deliberate passes—coarse region selection, focused capture, and final click—so the agent can reason about targets instead of guessing. Enable or tune it with BYTEBOT_SMART_FOCUS, BYTEBOT_OVERVIEW_GRID, BYTEBOT_REGION_GRID, BYTEBOT_FOCUSED_GRID, and related knobs documented in docs/SMART_FOCUS_SYSTEM.md.

Desktop accuracy overlay

Desktop Accuracy Drawer

The /desktop dashboard now ships with a Desktop Accuracy drawer that exposes the fork’s adaptive telemetry at a glance. The panel streams live stats for the currently selected session, lets operators jump between historical sessions with the session selector, and provides reset controls so you can zero out a learning run before capturing a new benchmark. Use the reset button to clear the in-memory metrics without restarting the daemon when you want a clean baseline for regression tests or demonstrations.

Desktop accuracy overlay

Learning Metrics Explained

To help you interpret the drawer’s live readouts, Hawkeye surfaces several learning metrics that highlight how the desktop agent is adapting:

  • Attempt count — The number of clicks evaluated during the active session. Use it to gauge how large the sample is before trusting the aggregate metrics.
  • Success rate — Percentage of attempts that landed within the configured smart click success radius. This reflects real-time precision as the agent iterates on a task.
  • Weighted offsets — Average X/Y drift in pixels, weighted by recency so the panel emphasizes the most recent behavior. Watch this to see whether recent tuning is nudging the cursor closer to targets.
  • Convergence — A decay-weighted score that trends toward 1.0 as the agent stops overshooting targets, signaling that the current calibration has stabilized.
  • Hotspots — Highlighted regions where misses cluster, helping you identify UI zones that need larger affordances, different prompts, or manual overrides.

Together, these metrics give you continuous feedback on how Hawkeye’s coordinate calibration improves over time and whether additional guardrails are necessary for stubborn workflows.

Enhanced Computer Vision Pipeline (OpenCV 4.6.0+)

Hawkeye leverages a comprehensive OpenCV 4.6.0 computer vision pipeline that dramatically improves UI automation accuracy through multiple detection methods:

Multi-Method Element Detection

  • Template Matching (TemplateMatcherService) - Multi-scale template matching with confidence scoring for pixel-perfect UI element matching
  • Feature Detection (FeatureMatcherService) - ORB and AKAZE feature matching with homography-based element localization, robust to UI variations
  • Contour Analysis (ContourDetectorService) - Shape-based detection for buttons, input fields, and icons using advanced morphological operations
  • Enhanced OCR Pipeline - Upgraded Tesseract preprocessing with morphological gradients, bilateral filtering, and CLAHE contrast enhancement

Intelligent Detection Orchestration

The EnhancedVisualDetectorService combines all CV methods intelligently:

// Comprehensive detection using all available methods
const result = await enhancedDetector.detectElements(screenshot, template, {
  useTemplateMatching: true,   // For exact UI matches
  useFeatureMatching: true,    // For robust element detection
  useContourDetection: true,   // For shape-based detection
  useOCR: true,               // For text-based elements
  combineResults: true        // Merge overlapping detections
});

Real-Time CV Activity Monitoring

  • Live Method Tracking - CVActivityIndicatorService tracks which CV methods are actively processing
  • Performance Metrics - Real-time success rates, processing times, and execution statistics
  • UI Integration - Server-Sent Events and REST endpoints for displaying active CV methods in the web interface
  • Debug Telemetry - Comprehensive method execution history for optimization

API Endpoints for CV Visibility

GET /cv-activity/status     # Current active methods snapshot
GET /cv-activity/active     # Quick active/inactive check
SSE /cv-activity/stream     # Real-time updates via Server-Sent Events
GET /cv-activity/performance # Method performance statistics

Universal Element Detection Pipeline

The enhanced system outputs structured UniversalUIElement objects by fusing:

  • Visual pattern detection (VisualPatternDetectorService) with OpenCV edge detection, CLAHE, and morphological operations
  • OCR enrichment through ElementDetectorService.detectElementsUniversal with advanced preprocessing techniques
  • Semantic analysis (TextSemanticAnalyzerService) for intent-based reasoning over raw UI text
  • Multi-method fusion that combines template, feature, contour, and OCR detections for maximum reliability

OpenCV 4.6.0 Capability Matrix

Active Methods: Template matching, Feature detection (ORB/AKAZE), Morphological operations, CLAHE, Gaussian blur, Bilateral filtering, Canny edge detection, Contour analysis, Adaptive thresholding

Preprocessing Pipeline: Multi-scale image processing, Noise reduction, Contrast enhancement, Edge enhancement, Shape analysis

UI Automation Features: Button detection, Input field identification, Icon recognition, Text extraction, Clickable element mapping

Keyboard & Shortcut Reliability

NutService on the desktop daemon parses compound shortcuts such as ctrl+shift+p, mixed-case modifiers, and platform aliases (cmd, option, win). Legacy arrays like ['Control', 'Shift', 'X'] continue to work, but LLM tool calls can now emit compact strings and rely on the daemon to normalize, validate, and execute the correct nut-js sequence.

Running the Default Compose Stack

To boot the Hawkeye desktop, agent, UI, and Postgres services without the LiteLLM proxy, use the standard compose topology:

docker compose -f docker/docker-compose.yml build
docker compose -f docker/docker-compose.yml up -d

Default endpoints:

  • Agent API – http://localhost:9991
  • Web UI – http://localhost:9992
  • Desktop noVNC – http://localhost:9990

If the UI reports ECONNREFUSED to 9991, ensure the agent container is healthy (docker compose ps bytebot-agent) and inspect its logs (docker compose logs bytebot-agent) for Prisma or DI failures.

Quick Start: Proxy Compose Stack

Desktop accuracy overlay

The fastest way to try Hawkeye is the proxy-enabled Docker Compose stack—it starts the desktop, agent, UI, Postgres, and LiteLLM proxy with every precision upgrade flipped on. Populate docker/.env with your model keys and the Hawkeye-specific toggles before you launch. OpenRouter and LMStudio are first-class in the default LiteLLM config, so set the matching environment variables and make sure the aliases in packages/bytebot-llm-proxy/litellm-config.yaml point to models you can reach:

  • OPENROUTER_API_KEY powers the openrouter-* aliases like openrouter-claude-3.7-sonnet.
  • LMStudio examples such as local-lmstudio-gemma-3-27b expect your local server’s api_base to match the running LMStudio instance.
  • BYTEBOT_GRID_OVERLAY=true keeps the labeled coordinate grid on every capture.
  • BYTEBOT_PROGRESSIVE_ZOOM_USE_AI=true enables the multi-zoom screenshot refinement.
  • BYTEBOT_SMART_FOCUS=true and BYTEBOT_SMART_FOCUS_MODEL=<litellm-alias> route Smart Focus through the proxy model you configure in the LiteLLM config.
  • BYTEBOT_COORDINATE_METRICS=true (plus optional BYTEBOT_COORDINATE_DEBUG=true) records the click accuracy telemetry that distinguishes the fork.
  • BYTEBOT_CV_ENHANCED_DETECTION=true enables the comprehensive OpenCV 4.6.0+ pipeline with template, feature, and contour detection.
  • BYTEBOT_CV_ACTIVITY_TRACKING=true activates real-time CV method monitoring and performance metrics for UI visibility.
cat <<'EOF' > docker/.env
# Provider keys for LiteLLM
OPENAI_API_KEY=sk-your-key
ANTHROPIC_API_KEY=...
OPENROUTER_API_KEY=...

# Hawkeye precision defaults
BYTEBOT_GRID_OVERLAY=true
BYTEBOT_PROGRESSIVE_ZOOM_USE_AI=true
BYTEBOT_SMART_FOCUS=true
BYTEBOT_SMART_FOCUS_MODEL=gpt-4o-mini
BYTEBOT_COORDINATE_METRICS=true

# Enhanced Computer Vision (OpenCV 4.6.0+)
BYTEBOT_CV_ENHANCED_DETECTION=true
BYTEBOT_CV_ACTIVITY_TRACKING=true
EOF

# Build images locally (default)
docker compose -f docker/docker-compose.proxy.yml up -d --build

# Prefer a different Postgres registry?
# export BYTEBOT_POSTGRES_IMAGE=postgres:16-alpine

# Or use pre-built registry images after authenticating:
# export BYTEBOT_DESKTOP_IMAGE=ghcr.io/bytebot-ai/bytebot-desktop:edge
# export BYTEBOT_AGENT_IMAGE=ghcr.io/bytebot-ai/bytebot-agent:edge
# export BYTEBOT_UI_IMAGE=ghcr.io/bytebot-ai/bytebot-ui:edge
# docker compose -f docker/docker-compose.proxy.yml up -d

Before you start the stack, edit packages/bytebot-llm-proxy/litellm-config.yaml so each alias maps to the OpenRouter endpoints or LMStudio bases you control. After saving changes, restart the bytebot-llm-proxy container (docker compose restart bytebot-llm-proxy) to reload the updated routing.

Alternative Deployments

Looking for a different hosting environment? Follow the upstream guides for the full walkthroughs:

Stay in Sync with Upstream Bytebot

For a full tour of the core desktop agent, installation options, and API surface, follow the upstream README and docs. Hawkeye inherits everything there—virtual desktop orchestration, task APIs, and deployment guides—so this fork focuses documentation on the precision tooling and measurement upgrades described above.

Further Reading

Operations & Tuning

Smart Click Success Radius

Smart click telemetry now records the real cursor landing position. Tune the pass/fail threshold by setting an environment variable on the desktop daemon:

export BYTEBOT_SMART_CLICK_SUCCESS_RADIUS=12  # pixels of acceptable drift

Increase the value if the VNC stream or hardware introduces more cursor drift, or decrease it to tighten the definition of a successful AI-guided click.

About

Bytebot is a self-hosted AI desktop agent that automates computer tasks through natural language commands, operating within a containerized Linux desktop environment.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • TypeScript 81.6%
  • JavaScript 12.7%
  • Dockerfile 1.6%
  • Shell 1.5%
  • CSS 0.7%
  • Go Template 0.5%
  • Other 1.4%