Skip to content

Conversation

@tkattkat
Copy link
Collaborator

@tkattkat tkattkat commented Dec 29, 2025

Why

Currently, we only provide screenshots to the agent when it explicitly calls the screenshot tool.

This is wasteful on both tokens and latency as it requires the LLM to make an additional tool call to get a new screenshot in situations where we can reliably assume it will need one — like after clicking, typing, scrolling, or waiting for content to load.

What Changed

Screenshot Feedback After Vision Actions

  • click, type, dragAndDrop, fillFormVision, and scroll tools now automatically return a screenshot alongside the tool result
  • wait tool returns a screenshot in hybrid mode only (since it's the mode that relies on visual feedback)
  • added screenshotHandler which abstracts handling delay + screenshot to a shared util function
  • updated action handler to parse base64's from stored actions

Standardized Post Action Delay

  • Removed Google specific delay logic
  • Added a universal 500ms delay before capturing screenshots, giving the page time to settle after actions ( some of the tools override this with their own delays, typically with 0 due to not needing to wait after these actions

Message Compression Updates

  • All vision action tool screenshots now share the same "2 most recent" compression pool as the screenshot tool
  • Older screenshots are stripped from message history while preserving text results

Test Plan

  • Tested locally with hybrid mode agent
  • Verified screenshots are returned after each vision action
  • Confirmed message compression correctly keeps only the 2 most recent screenshots across all vision tools

Summary by cubic

Automatically returns a screenshot after vision actions so the agent gets instant visual feedback without extra screenshot tool calls, reducing tokens and latency.

  • New Features

    • click, type, dragAndDrop, scroll, and fillFormVision now return a screenshot; wait returns one in hybrid mode.
    • Message compression keeps only the 2 most recent vision items (screenshots or vision-action outputs); older images are dropped.
    • Tools emit media via toModelOutput so the model receives image + text together.
  • Refactors

    • Added waitAndCaptureScreenshot with a default 500ms post-action delay; removed Google-specific delays.
    • Excluded screenshotBase64 from recorded actions; updated action/message mapping to strip large fields.
    • Introduced typed tool results and wired wait(v3, mode) in createAgentTools for hybrid-only screenshot behavior.

Written for commit 93faeb8. Summary will update automatically on new commits.

@changeset-bot
Copy link

changeset-bot bot commented Dec 29, 2025

🦋 Changeset detected

Latest commit: 93faeb8

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 3 packages
Name Type
@browserbasehq/stagehand Patch
@browserbasehq/stagehand-evals Patch
@browserbasehq/stagehand-server Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Dec 29, 2025

Greptile Summary

This PR implements automatic screenshot feedback after vision-based agent actions to reduce latency and token usage by eliminating the need for explicit screenshot tool calls.

Key Changes:

  • Vision action tools (click, type, dragAndDrop, fillFormVision, scroll) now automatically return screenshots alongside tool results
  • wait tool returns screenshots only in hybrid mode where visual feedback is needed
  • Created centralized screenshotHandler utility with error handling and configurable delays
  • Standardized 500ms post-action delay (with overrides for specific tools) to allow pages to settle
  • Extended message compression to treat vision action screenshots the same as explicit screenshot tool calls, keeping only the 2 most recent
  • Added stripExcludedKeys in action mapping to prevent screenshotBase64 from bloating the actions array
  • Removed Google-specific delay logic in favor of universal approach

Implementation Quality:
The implementation is well-structured with proper TypeScript typing, error handling (screenshot failures don't fail the action), and consistent toModelOutput implementations across all tools. The message compression correctly preserves text results while removing image data from older actions.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The changes are well-architected with proper error handling, type safety, and consistent patterns across all tools. The screenshot capture is wrapped in try-catch blocks to prevent action failures, and the message compression logic correctly handles both new and existing vision data. The removal of Google-specific delays in favor of a universal approach is a good simplification.
  • No files require special attention

Important Files Changed

Filename Overview
packages/core/lib/v3/agent/utils/screenshotHandler.ts New utility function centralizing screenshot capture logic with error handling
packages/core/lib/v3/agent/tools/click.ts Removed Google-specific delay; added automatic screenshot capture and toModelOutput implementation
packages/core/lib/v3/agent/tools/type.ts Removed Google-specific delay; added screenshot capture after typing and toModelOutput implementation
packages/core/lib/v3/agent/utils/messageProcessing.ts Extended compression to handle vision action tools alongside screenshot tool
packages/core/lib/v3/agent/utils/actionMapping.ts Added logic to strip screenshotBase64 from action outputs to prevent data bloat
packages/core/lib/v3/agent/tools/wait.ts Added conditional screenshot capture for hybrid mode and toModelOutput implementation

Sequence Diagram

sequenceDiagram
    participant Agent as AI Agent
    participant Tool as Vision Tool (click/type/etc)
    participant Handler as screenshotHandler
    participant Page as Browser Page
    participant Compress as messageProcessing

    Agent->>Tool: execute action (e.g., click)
    Tool->>Page: perform action (click/type/scroll)
    Tool->>Handler: waitAndCaptureScreenshot(page, delay)
    Handler->>Page: waitForTimeout(500ms)
    Handler->>Page: screenshot()
    Page-->>Handler: buffer
    Handler-->>Tool: base64 string (or undefined on error)
    Tool-->>Agent: {success, ...result, screenshotBase64}
    
    Note over Tool,Agent: toModelOutput formats response
    Tool->>Agent: {text: result, media: screenshot}
    
    Agent->>Compress: processMessages(messages)
    Note over Compress: Keeps only 2 most recent<br/>vision results (screenshots<br/>+ vision actions)
    Compress->>Compress: identify vision tools
    Compress->>Compress: strip media from old results
    Compress-->>Agent: compressed messages
Loading

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 issues found across 12 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="packages/core/lib/v3/agent/tools/type.ts">

<violation number="1" location="packages/core/lib/v3/agent/tools/type.ts:92">
P2: The `toModelOutput` condition incorrectly uses `screenshotBase64` presence to determine output format. Since `waitAndCaptureScreenshot` can return `undefined` on screenshot failure (without throwing), a successful typing operation may lose `describe` and `text` fields in the model output. Consider using `result.success` as the discriminant instead.</violation>
</file>

<file name="packages/core/lib/v3/agent/tools/click.ts">

<violation number="1" location="packages/core/lib/v3/agent/tools/click.ts:86">
P2: The `toModelOutput` condition checks `result.screenshotBase64` to determine success vs error output, but `waitAndCaptureScreenshot` can return `undefined` when the screenshot fails even though the click succeeded. This would cause a successful click with a failed screenshot to output `{success: true, error: undefined}`, losing the `describe` and `coordinates` data. Consider checking `result.success` instead.</violation>
</file>

<file name="packages/core/lib/v3/agent/tools/fillFormVision.ts">

<violation number="1" location="packages/core/lib/v3/agent/tools/fillFormVision.ts:134">
P2: Using `screenshotBase64` presence to determine success/error output is incorrect. When the form fill succeeds but screenshot capture fails, this returns `{ success: true, error: undefined }` instead of `{ success: true, fieldsCount: N }`. Consider basing the text content on `result.success` instead.</violation>
</file>

<file name="packages/core/lib/v3/agent/tools/dragAndDrop.ts">

<violation number="1" location="packages/core/lib/v3/agent/tools/dragAndDrop.ts:104">
P2: Using `result.screenshotBase64` as the condition may cause the `describe` field to be lost from output when a successful drag action fails to capture a screenshot. According to the `waitAndCaptureScreenshot` utility, it can return `undefined` on failure instead of throwing. Consider checking `result.success` instead to ensure proper field selection.</violation>
</file>

<file name="packages/core/lib/v3/agent/utils/messageProcessing.ts">

<violation number="1" location="packages/core/lib/v3/agent/utils/messageProcessing.ts:90">
P2: If a message contains both screenshot and vision action parts, only the screenshot gets compressed due to the if-else branch. Both compression functions should be called since they only process their respective part types and are safe to call on any message.</violation>
</file>

Reply to cubic to teach it or ask questions. Tag @cubic-dev-ai to re-run a review.

Comment on lines +17 to +31
export async function waitAndCaptureScreenshot(
page: Page,
delayMs: number = DEFAULT_DELAY_MS,
): Promise<string | undefined> {
if (delayMs > 0) {
await page.waitForTimeout(delayMs);
}

try {
const buffer = await page.screenshot({ fullPage: false });
return buffer.toString("base64");
} catch {
return undefined;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this helper? I'd just inline it and make page.waitForTimeout a no-op when delay is 0

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally inlined it, but it started bothering me after using it in 6 places with a try catch for each so decided to make it a helper

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants