-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Screenshot after actions #1483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Screenshot after actions #1483
Conversation
🦋 Changeset detectedLatest commit: 93faeb8 The changes in this PR will be included in the next version bump. This PR includes changesets to release 3 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
Greptile SummaryThis PR implements automatic screenshot feedback after vision-based agent actions to reduce latency and token usage by eliminating the need for explicit Key Changes:
Implementation Quality: Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Agent as AI Agent
participant Tool as Vision Tool (click/type/etc)
participant Handler as screenshotHandler
participant Page as Browser Page
participant Compress as messageProcessing
Agent->>Tool: execute action (e.g., click)
Tool->>Page: perform action (click/type/scroll)
Tool->>Handler: waitAndCaptureScreenshot(page, delay)
Handler->>Page: waitForTimeout(500ms)
Handler->>Page: screenshot()
Page-->>Handler: buffer
Handler-->>Tool: base64 string (or undefined on error)
Tool-->>Agent: {success, ...result, screenshotBase64}
Note over Tool,Agent: toModelOutput formats response
Tool->>Agent: {text: result, media: screenshot}
Agent->>Compress: processMessages(messages)
Note over Compress: Keeps only 2 most recent<br/>vision results (screenshots<br/>+ vision actions)
Compress->>Compress: identify vision tools
Compress->>Compress: strip media from old results
Compress-->>Agent: compressed messages
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 issues found across 12 files
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="packages/core/lib/v3/agent/tools/type.ts">
<violation number="1" location="packages/core/lib/v3/agent/tools/type.ts:92">
P2: The `toModelOutput` condition incorrectly uses `screenshotBase64` presence to determine output format. Since `waitAndCaptureScreenshot` can return `undefined` on screenshot failure (without throwing), a successful typing operation may lose `describe` and `text` fields in the model output. Consider using `result.success` as the discriminant instead.</violation>
</file>
<file name="packages/core/lib/v3/agent/tools/click.ts">
<violation number="1" location="packages/core/lib/v3/agent/tools/click.ts:86">
P2: The `toModelOutput` condition checks `result.screenshotBase64` to determine success vs error output, but `waitAndCaptureScreenshot` can return `undefined` when the screenshot fails even though the click succeeded. This would cause a successful click with a failed screenshot to output `{success: true, error: undefined}`, losing the `describe` and `coordinates` data. Consider checking `result.success` instead.</violation>
</file>
<file name="packages/core/lib/v3/agent/tools/fillFormVision.ts">
<violation number="1" location="packages/core/lib/v3/agent/tools/fillFormVision.ts:134">
P2: Using `screenshotBase64` presence to determine success/error output is incorrect. When the form fill succeeds but screenshot capture fails, this returns `{ success: true, error: undefined }` instead of `{ success: true, fieldsCount: N }`. Consider basing the text content on `result.success` instead.</violation>
</file>
<file name="packages/core/lib/v3/agent/tools/dragAndDrop.ts">
<violation number="1" location="packages/core/lib/v3/agent/tools/dragAndDrop.ts:104">
P2: Using `result.screenshotBase64` as the condition may cause the `describe` field to be lost from output when a successful drag action fails to capture a screenshot. According to the `waitAndCaptureScreenshot` utility, it can return `undefined` on failure instead of throwing. Consider checking `result.success` instead to ensure proper field selection.</violation>
</file>
<file name="packages/core/lib/v3/agent/utils/messageProcessing.ts">
<violation number="1" location="packages/core/lib/v3/agent/utils/messageProcessing.ts:90">
P2: If a message contains both screenshot and vision action parts, only the screenshot gets compressed due to the if-else branch. Both compression functions should be called since they only process their respective part types and are safe to call on any message.</violation>
</file>
Reply to cubic to teach it or ask questions. Tag @cubic-dev-ai to re-run a review.
| export async function waitAndCaptureScreenshot( | ||
| page: Page, | ||
| delayMs: number = DEFAULT_DELAY_MS, | ||
| ): Promise<string | undefined> { | ||
| if (delayMs > 0) { | ||
| await page.waitForTimeout(delayMs); | ||
| } | ||
|
|
||
| try { | ||
| const buffer = await page.screenshot({ fullPage: false }); | ||
| return buffer.toString("base64"); | ||
| } catch { | ||
| return undefined; | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this helper? I'd just inline it and make page.waitForTimeout a no-op when delay is 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally inlined it, but it started bothering me after using it in 6 places with a try catch for each so decided to make it a helper
Why
Currently, we only provide screenshots to the agent when it explicitly calls the
screenshottool.This is wasteful on both tokens and latency as it requires the LLM to make an additional tool call to get a new screenshot in situations where we can reliably assume it will need one — like after clicking, typing, scrolling, or waiting for content to load.
What Changed
Screenshot Feedback After Vision Actions
click,type,dragAndDrop,fillFormVision, andscrolltools now automatically return a screenshot alongside the tool resultwaittool returns a screenshot in hybrid mode only (since it's the mode that relies on visual feedback)Standardized Post Action Delay
Message Compression Updates
screenshottoolTest Plan
Summary by cubic
Automatically returns a screenshot after vision actions so the agent gets instant visual feedback without extra screenshot tool calls, reducing tokens and latency.
New Features
Refactors
Written for commit 93faeb8. Summary will update automatically on new commits.