Screenshot after actions #1483

tkattkat · 2025-12-29T21:47:34Z

Why

Currently, we only provide screenshots to the agent when it explicitly calls the screenshot tool.

This is wasteful on both tokens and latency as it requires the LLM to make an additional tool call to get a new screenshot in situations where we can reliably assume it will need one — like after clicking, typing, scrolling, or waiting for content to load.

What Changed

Screenshot Feedback After Vision Actions

click, type, dragAndDrop, fillFormVision, and scroll tools now automatically return a screenshot alongside the tool result
wait tool returns a screenshot in hybrid mode only (since it's the mode that relies on visual feedback)
added screenshotHandler which abstracts handling delay + screenshot to a shared util function
updated action handler to parse base64's from stored actions

Standardized Post Action Delay

Removed Google specific delay logic
Added a universal 500ms delay before capturing screenshots, giving the page time to settle after actions ( some of the tools override this with their own delays, typically with 0 due to not needing to wait after these actions

Message Compression Updates

All vision action tool screenshots now share the same "2 most recent" compression pool as the screenshot tool
Older screenshots are stripped from message history while preserving text results

Test Plan

Tested locally with hybrid mode agent
Verified screenshots are returned after each vision action
Confirmed message compression correctly keeps only the 2 most recent screenshots across all vision tools

Summary by cubic

Automatically returns a screenshot after vision actions so the agent gets instant visual feedback without extra screenshot tool calls, reducing tokens and latency.

New Features
- click, type, dragAndDrop, scroll, and fillFormVision now return a screenshot; wait returns one in hybrid mode.
- Message compression keeps only the 2 most recent vision items (screenshots or vision-action outputs); older images are dropped.
- Tools emit media via toModelOutput so the model receives image + text together.
Refactors
- Added waitAndCaptureScreenshot with a default 500ms post-action delay; removed Google-specific delays.
- Excluded screenshotBase64 from recorded actions; updated action/message mapping to strip large fields.
- Introduced typed tool results and wired wait(v3, mode) in createAgentTools for hybrid-only screenshot behavior.

^{Written for commit 93faeb8. Summary will update automatically on new commits.}

changeset-bot · 2025-12-29T21:47:37Z

🦋 Changeset detected

Latest commit: 93faeb8

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 3 packages

Name	Type
@browserbasehq/stagehand	Patch
@browserbasehq/stagehand-evals	Patch
@browserbasehq/stagehand-server	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

greptile-apps · 2025-12-29T21:49:57Z

Greptile Summary

This PR implements automatic screenshot feedback after vision-based agent actions to reduce latency and token usage by eliminating the need for explicit screenshot tool calls.

Key Changes:

Vision action tools (click, type, dragAndDrop, fillFormVision, scroll) now automatically return screenshots alongside tool results
wait tool returns screenshots only in hybrid mode where visual feedback is needed
Created centralized screenshotHandler utility with error handling and configurable delays
Standardized 500ms post-action delay (with overrides for specific tools) to allow pages to settle
Extended message compression to treat vision action screenshots the same as explicit screenshot tool calls, keeping only the 2 most recent
Added stripExcludedKeys in action mapping to prevent screenshotBase64 from bloating the actions array
Removed Google-specific delay logic in favor of universal approach

Implementation Quality:
The implementation is well-structured with proper TypeScript typing, error handling (screenshot failures don't fail the action), and consistent toModelOutput implementations across all tools. The message compression correctly preserves text results while removing image data from older actions.

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The changes are well-architected with proper error handling, type safety, and consistent patterns across all tools. The screenshot capture is wrapped in try-catch blocks to prevent action failures, and the message compression logic correctly handles both new and existing vision data. The removal of Google-specific delays in favor of a universal approach is a good simplification.
No files require special attention

Important Files Changed

Filename	Overview
packages/core/lib/v3/agent/utils/screenshotHandler.ts	New utility function centralizing screenshot capture logic with error handling
packages/core/lib/v3/agent/tools/click.ts	Removed Google-specific delay; added automatic screenshot capture and toModelOutput implementation
packages/core/lib/v3/agent/tools/type.ts	Removed Google-specific delay; added screenshot capture after typing and toModelOutput implementation
packages/core/lib/v3/agent/utils/messageProcessing.ts	Extended compression to handle vision action tools alongside screenshot tool
packages/core/lib/v3/agent/utils/actionMapping.ts	Added logic to strip screenshotBase64 from action outputs to prevent data bloat
packages/core/lib/v3/agent/tools/wait.ts	Added conditional screenshot capture for hybrid mode and toModelOutput implementation

Sequence Diagram

sequenceDiagram
    participant Agent as AI Agent
    participant Tool as Vision Tool (click/type/etc)
    participant Handler as screenshotHandler
    participant Page as Browser Page
    participant Compress as messageProcessing

    Agent->>Tool: execute action (e.g., click)
    Tool->>Page: perform action (click/type/scroll)
    Tool->>Handler: waitAndCaptureScreenshot(page, delay)
    Handler->>Page: waitForTimeout(500ms)
    Handler->>Page: screenshot()
    Page-->>Handler: buffer
    Handler-->>Tool: base64 string (or undefined on error)
    Tool-->>Agent: {success, ...result, screenshotBase64}
    
    Note over Tool,Agent: toModelOutput formats response
    Tool->>Agent: {text: result, media: screenshot}
    
    Agent->>Compress: processMessages(messages)
    Note over Compress: Keeps only 2 most recent<br/>vision results (screenshots<br/>+ vision actions)
    Compress->>Compress: identify vision tools
    Compress->>Compress: strip media from old results
    Compress-->>Agent: compressed messages

cubic-dev-ai

5 issues found across 12 files

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="packages/core/lib/v3/agent/tools/type.ts">

<violation number="1" location="packages/core/lib/v3/agent/tools/type.ts:92">
P2: The `toModelOutput` condition incorrectly uses `screenshotBase64` presence to determine output format. Since `waitAndCaptureScreenshot` can return `undefined` on screenshot failure (without throwing), a successful typing operation may lose `describe` and `text` fields in the model output. Consider using `result.success` as the discriminant instead.</violation>
</file>

<file name="packages/core/lib/v3/agent/tools/click.ts">

<violation number="1" location="packages/core/lib/v3/agent/tools/click.ts:86">
P2: The `toModelOutput` condition checks `result.screenshotBase64` to determine success vs error output, but `waitAndCaptureScreenshot` can return `undefined` when the screenshot fails even though the click succeeded. This would cause a successful click with a failed screenshot to output `{success: true, error: undefined}`, losing the `describe` and `coordinates` data. Consider checking `result.success` instead.</violation>
</file>

<file name="packages/core/lib/v3/agent/tools/fillFormVision.ts">

<violation number="1" location="packages/core/lib/v3/agent/tools/fillFormVision.ts:134">
P2: Using `screenshotBase64` presence to determine success/error output is incorrect. When the form fill succeeds but screenshot capture fails, this returns `{ success: true, error: undefined }` instead of `{ success: true, fieldsCount: N }`. Consider basing the text content on `result.success` instead.</violation>
</file>

<file name="packages/core/lib/v3/agent/tools/dragAndDrop.ts">

<violation number="1" location="packages/core/lib/v3/agent/tools/dragAndDrop.ts:104">
P2: Using `result.screenshotBase64` as the condition may cause the `describe` field to be lost from output when a successful drag action fails to capture a screenshot. According to the `waitAndCaptureScreenshot` utility, it can return `undefined` on failure instead of throwing. Consider checking `result.success` instead to ensure proper field selection.</violation>
</file>

<file name="packages/core/lib/v3/agent/utils/messageProcessing.ts">

<violation number="1" location="packages/core/lib/v3/agent/utils/messageProcessing.ts:90">
P2: If a message contains both screenshot and vision action parts, only the screenshot gets compressed due to the if-else branch. Both compression functions should be called since they only process their respective part types and are safe to call on any message.</violation>
</file>

_{Reply to cubic to teach it or ask questions. Tag @cubic-dev-ai to re-run a review.}

packages/core/lib/v3/agent/tools/type.ts

packages/core/lib/v3/agent/tools/click.ts

packages/core/lib/v3/agent/tools/fillFormVision.ts

packages/core/lib/v3/agent/tools/dragAndDrop.ts

packages/core/lib/v3/agent/utils/messageProcessing.ts

pirate · 2025-12-29T22:52:56Z

packages/core/lib/v3/agent/utils/screenshotHandler.ts

+export async function waitAndCaptureScreenshot(
+  page: Page,
+  delayMs: number = DEFAULT_DELAY_MS,
+): Promise<string | undefined> {
+  if (delayMs > 0) {
+    await page.waitForTimeout(delayMs);
+  }
+
+  try {
+    const buffer = await page.screenshot({ fullPage: false });
+    return buffer.toString("base64");
+  } catch {
+    return undefined;
+  }
+}


do we need this helper? I'd just inline it and make page.waitForTimeout a no-op when delay is 0

I originally inlined it, but it started bothering me after using it in 6 places with a try catch for each so decided to make it a helper

tkattkat added 9 commits December 29, 2025 10:27

screenshot after actions

2f8428f

add screenshot after wait

50cd0f0

Merge remote-tracking branch 'origin/main' into screenshot-after-actions

6000f67

update to use waitforTimeout

fcfb9d6

merge main - preserve screenshot and caching logic

e6b2874

rm

76faa88

revert example changes

a6424d6

rm

044af80

remove base64 from actions, abstract screenshot handling

ac17613

tkattkat added 2 commits December 29, 2025 13:49

remove silly comments

a07f708

changeset

9555ad2

cubic-dev-ai bot reviewed Dec 29, 2025

View reviewed changes

tkattkat added 2 commits December 29, 2025 14:05

address cubic review comments

69bc422

format

93faeb8

pirate reviewed Dec 29, 2025

View reviewed changes

pirate approved these changes Dec 29, 2025

View reviewed changes

tkattkat merged commit 16d72fb into main Dec 29, 2025
35 of 36 checks passed

This was referenced Dec 29, 2025

Version Packages #1479

Open

Version Packages CloudEngineHub/stagehand#1

Open

Version Packages ahmedtwine/stagehand#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Screenshot after actions #1483

Screenshot after actions #1483

Uh oh!

tkattkat commented Dec 29, 2025 •

edited by cubic-dev-ai bot

Loading

Uh oh!

changeset-bot bot commented Dec 29, 2025 •

edited

Loading

Uh oh!

greptile-apps bot commented Dec 29, 2025

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pirate Dec 29, 2025

Uh oh!

tkattkat Dec 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Screenshot after actions #1483

Screenshot after actions #1483

Uh oh!

Conversation

tkattkat commented Dec 29, 2025 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What Changed

Screenshot Feedback After Vision Actions

Standardized Post Action Delay

Message Compression Updates

Test Plan

Summary by cubic

Uh oh!

changeset-bot bot commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

greptile-apps bot commented Dec 29, 2025

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pirate Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

tkattkat Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tkattkat commented Dec 29, 2025 •

edited by cubic-dev-ai bot

Loading

changeset-bot bot commented Dec 29, 2025 •

edited

Loading