An AI that has its own computer to complete tasks for you
ℹ️ A Holo-based variant is also in progress: Bytebot Hawkeye Holo, powered by the Holo 1.5 7B model.
Resources & Translations
Hawkeye layers precision tooling on top of upstream Bytebot so the agent can land clicks with far greater reliability:
| Capability | Hawkeye | Upstream Bytebot |
|---|---|---|
| Grid overlay guidance | Always-on 100 px grid with labeled axes and optional debug overlays toggled via BYTEBOT_GRID_OVERLAY/BYTEBOT_GRID_DEBUG, plus a live preview in the overlay capture. |
No persistent spatial scaffolding; relies on raw screenshots. |
| Smart Focus targeting | Three-stage coarse→focus→click workflow with tunable grids and prompts described in Smart Focus System. | Single-shot click reasoning without structured zoom or guardrails. |
| Progressive zoom capture | Deterministic zoom ladder with cyan micro-grids that map local→global coordinates; see zoom samples. | Manual zoom commands with no coordinate reconciliation. |
| Coordinate telemetry & accuracy | Telemetry pipeline with BYTEBOT_COORDINATE_METRICS and BYTEBOT_COORDINATE_DEBUG, an attempt towards accuracy.(COORDINATE_ACCURACY_IMPROVEMENTS.md). |
No automated accuracy measurement or debug dataset. |
| Universal coordinate mapping | Shared lookup in config/universal-coordinates.yaml bundled in repo and @bytebot/shared, auto-discovered without extra configuration. |
Requires custom configuration for consistent coordinate frames. |
| Universal element detection | CV pipeline merges visual heuristics, OCR enrichments, and semantic roles to emit consistent UniversalUIElement metadata for buttons, inputs, and clickable controls. |
LLM prompts must infer UI semantics from raw OCR spans and manually chosen click targets. |
| Enhanced OpenCV 4.6.0+ pipeline | Multi-method detection with template matching, feature detection (ORB/AKAZE), contour analysis, and advanced OCR preprocessing with morphological operations and CLAHE enhancement. | Basic screenshot analysis without advanced computer vision techniques. |
| Real-time CV activity monitoring | Live tracking of active computer vision methods with performance metrics, success rates, and UI indicators showing which CV methods are processing. API endpoints and SSE streams for real-time visibility. | No visibility into which detection methods are active or their performance characteristics. |
| Accessible UI theming | Header theme toggle powered by Next.js theme switching delivers high-contrast light/dark palettes so operators can pick the most legible view. | Single default theme without in-app toggles. |
| Active Model desktop telemetry | The desktop dashboard's Active Model card (under /desktop) continuously surfaces the agent's current provider, model alias, and streaming heartbeat so you can spot token stalls before they derail long-running sessions. |
No dedicated real-time status card—operators must tail logs to confirm the active model. |
Flip individual systems off by setting the corresponding environment variables—BYTEBOT_UNIVERSAL_TEACHING, BYTEBOT_ADAPTIVE_CALIBRATION, BYTEBOT_ZOOM_REFINEMENT, or BYTEBOT_COORDINATE_METRICS—to false (default true). Enable deep-dive logs with BYTEBOT_COORDINATE_DEBUG=true when troubleshooting. Visit the /desktop route (see the screenshot above) to monitor the Active Model card while long-running tasks execute.
The fork’s Smart Focus workflow narrows attention in three deliberate passes—coarse region selection, focused capture, and final click—so the agent can reason about targets instead of guessing. Enable or tune it with BYTEBOT_SMART_FOCUS, BYTEBOT_OVERVIEW_GRID, BYTEBOT_REGION_GRID, BYTEBOT_FOCUSED_GRID, and related knobs documented in docs/SMART_FOCUS_SYSTEM.md.
The /desktop dashboard now ships with a Desktop Accuracy drawer that exposes the fork’s adaptive telemetry at a glance. The panel streams live stats for the currently selected session, lets operators jump between historical sessions with the session selector, and provides reset controls so you can zero out a learning run before capturing a new benchmark. Use the reset button to clear the in-memory metrics without restarting the daemon when you want a clean baseline for regression tests or demonstrations.
To help you interpret the drawer’s live readouts, Hawkeye surfaces several learning metrics that highlight how the desktop agent is adapting:
- Attempt count — The number of clicks evaluated during the active session. Use it to gauge how large the sample is before trusting the aggregate metrics.
- Success rate — Percentage of attempts that landed within the configured smart click success radius. This reflects real-time precision as the agent iterates on a task.
- Weighted offsets — Average X/Y drift in pixels, weighted by recency so the panel emphasizes the most recent behavior. Watch this to see whether recent tuning is nudging the cursor closer to targets.
- Convergence — A decay-weighted score that trends toward 1.0 as the agent stops overshooting targets, signaling that the current calibration has stabilized.
- Hotspots — Highlighted regions where misses cluster, helping you identify UI zones that need larger affordances, different prompts, or manual overrides.
Together, these metrics give you continuous feedback on how Hawkeye’s coordinate calibration improves over time and whether additional guardrails are necessary for stubborn workflows.
Hawkeye leverages a comprehensive OpenCV 4.6.0 computer vision pipeline that dramatically improves UI automation accuracy through multiple detection methods:
- Template Matching (
TemplateMatcherService) - Multi-scale template matching with confidence scoring for pixel-perfect UI element matching - Feature Detection (
FeatureMatcherService) - ORB and AKAZE feature matching with homography-based element localization, robust to UI variations - Contour Analysis (
ContourDetectorService) - Shape-based detection for buttons, input fields, and icons using advanced morphological operations - Enhanced OCR Pipeline - Upgraded Tesseract preprocessing with morphological gradients, bilateral filtering, and CLAHE contrast enhancement
The EnhancedVisualDetectorService combines all CV methods intelligently:
// Comprehensive detection using all available methods
const result = await enhancedDetector.detectElements(screenshot, template, {
useTemplateMatching: true, // For exact UI matches
useFeatureMatching: true, // For robust element detection
useContourDetection: true, // For shape-based detection
useOCR: true, // For text-based elements
combineResults: true // Merge overlapping detections
});- Live Method Tracking -
CVActivityIndicatorServicetracks which CV methods are actively processing - Performance Metrics - Real-time success rates, processing times, and execution statistics
- UI Integration - Server-Sent Events and REST endpoints for displaying active CV methods in the web interface
- Debug Telemetry - Comprehensive method execution history for optimization
GET /cv-activity/status # Current active methods snapshot
GET /cv-activity/active # Quick active/inactive check
SSE /cv-activity/stream # Real-time updates via Server-Sent Events
GET /cv-activity/performance # Method performance statisticsThe enhanced system outputs structured UniversalUIElement objects by fusing:
- Visual pattern detection (
VisualPatternDetectorService) with OpenCV edge detection, CLAHE, and morphological operations - OCR enrichment through
ElementDetectorService.detectElementsUniversalwith advanced preprocessing techniques - Semantic analysis (
TextSemanticAnalyzerService) for intent-based reasoning over raw UI text - Multi-method fusion that combines template, feature, contour, and OCR detections for maximum reliability
✅ Active Methods: Template matching, Feature detection (ORB/AKAZE), Morphological operations, CLAHE, Gaussian blur, Bilateral filtering, Canny edge detection, Contour analysis, Adaptive thresholding
✅ Preprocessing Pipeline: Multi-scale image processing, Noise reduction, Contrast enhancement, Edge enhancement, Shape analysis
✅ UI Automation Features: Button detection, Input field identification, Icon recognition, Text extraction, Clickable element mapping
NutService on the desktop daemon parses compound shortcuts such as ctrl+shift+p, mixed-case modifiers, and platform aliases (cmd, option, win). Legacy arrays like ['Control', 'Shift', 'X'] continue to work, but LLM tool calls can now emit compact strings and rely on the daemon to normalize, validate, and execute the correct nut-js sequence.
To boot the Hawkeye desktop, agent, UI, and Postgres services without the LiteLLM proxy, use the standard compose topology:
docker compose -f docker/docker-compose.yml build
docker compose -f docker/docker-compose.yml up -dDefault endpoints:
- Agent API –
http://localhost:9991 - Web UI –
http://localhost:9992 - Desktop noVNC –
http://localhost:9990
If the UI reports ECONNREFUSED to 9991, ensure the agent container is healthy (docker compose ps bytebot-agent) and inspect its logs (docker compose logs bytebot-agent) for Prisma or DI failures.
The fastest way to try Hawkeye is the proxy-enabled Docker Compose stack—it starts the desktop, agent, UI, Postgres, and LiteLLM proxy with every precision upgrade flipped on. Populate docker/.env with your model keys and the Hawkeye-specific toggles before you launch. OpenRouter and LMStudio are first-class in the default LiteLLM config, so set the matching environment variables and make sure the aliases in packages/bytebot-llm-proxy/litellm-config.yaml point to models you can reach:
OPENROUTER_API_KEYpowers theopenrouter-*aliases likeopenrouter-claude-3.7-sonnet.- LMStudio examples such as
local-lmstudio-gemma-3-27bexpect your local server’sapi_baseto match the running LMStudio instance. BYTEBOT_GRID_OVERLAY=truekeeps the labeled coordinate grid on every capture.BYTEBOT_PROGRESSIVE_ZOOM_USE_AI=trueenables the multi-zoom screenshot refinement.BYTEBOT_SMART_FOCUS=trueandBYTEBOT_SMART_FOCUS_MODEL=<litellm-alias>route Smart Focus through the proxy model you configure in the LiteLLM config.BYTEBOT_COORDINATE_METRICS=true(plus optionalBYTEBOT_COORDINATE_DEBUG=true) records the click accuracy telemetry that distinguishes the fork.BYTEBOT_CV_ENHANCED_DETECTION=trueenables the comprehensive OpenCV 4.6.0+ pipeline with template, feature, and contour detection.BYTEBOT_CV_ACTIVITY_TRACKING=trueactivates real-time CV method monitoring and performance metrics for UI visibility.
cat <<'EOF' > docker/.env
# Provider keys for LiteLLM
OPENAI_API_KEY=sk-your-key
ANTHROPIC_API_KEY=...
OPENROUTER_API_KEY=...
# Hawkeye precision defaults
BYTEBOT_GRID_OVERLAY=true
BYTEBOT_PROGRESSIVE_ZOOM_USE_AI=true
BYTEBOT_SMART_FOCUS=true
BYTEBOT_SMART_FOCUS_MODEL=gpt-4o-mini
BYTEBOT_COORDINATE_METRICS=true
# Enhanced Computer Vision (OpenCV 4.6.0+)
BYTEBOT_CV_ENHANCED_DETECTION=true
BYTEBOT_CV_ACTIVITY_TRACKING=true
EOF
# Build images locally (default)
docker compose -f docker/docker-compose.proxy.yml up -d --build
# Prefer a different Postgres registry?
# export BYTEBOT_POSTGRES_IMAGE=postgres:16-alpine
# Or use pre-built registry images after authenticating:
# export BYTEBOT_DESKTOP_IMAGE=ghcr.io/bytebot-ai/bytebot-desktop:edge
# export BYTEBOT_AGENT_IMAGE=ghcr.io/bytebot-ai/bytebot-agent:edge
# export BYTEBOT_UI_IMAGE=ghcr.io/bytebot-ai/bytebot-ui:edge
# docker compose -f docker/docker-compose.proxy.yml up -dBefore you start the stack, edit packages/bytebot-llm-proxy/litellm-config.yaml so each alias maps to the OpenRouter endpoints or LMStudio bases you control. After saving changes, restart the bytebot-llm-proxy container (docker compose restart bytebot-llm-proxy) to reload the updated routing.
Looking for a different hosting environment? Follow the upstream guides for the full walkthroughs:
For a full tour of the core desktop agent, installation options, and API surface, follow the upstream README and docs. Hawkeye inherits everything there—virtual desktop orchestration, task APIs, and deployment guides—so this fork focuses documentation on the precision tooling and measurement upgrades described above.
Smart click telemetry now records the real cursor landing position. Tune the pass/fail threshold by setting an environment variable on the desktop daemon:
export BYTEBOT_SMART_CLICK_SUCCESS_RADIUS=12 # pixels of acceptable driftIncrease the value if the VNC stream or hardware introduces more cursor drift, or decrease it to tighten the definition of a successful AI-guided click.


