update flaky evals #1507

tkattkat · 2026-01-06T21:50:16Z

why

currently there is a url check condition, which often causes apple_tv and nbaTrades to fail.
the youtube eval also always fails due to the page not loadkng

what changed

update evals with url checks to not use url checks,
remove youtube eval

test plan

ran the evals

Summary by cubic

Reduce flaky agent evals by simplifying success checks and using screenshot-based evaluation with agent reasoning. Removed the broken YouTube task and improved reliability across Apple TV, AllRecipes, Google Maps, GitHub React version, and the arXiv GPT-4 report.

Bug Fixes
- all_recipes, apple_tv, nba_trades: success now relies only on agent evaluation; removed URL checks.
- iframe_form_multiple: updated instructions and switched to screenshot-based evaluation using agent reasoning; single evaluator YES.
- github_react_version: switched to screenshot-based evaluation using agent reasoning; single evaluator YES.
- google_maps and google_maps_2: switched to screenshot-based evaluation using agent reasoning; simplified success to a single evaluator YES.
- arxiv_gpt_report: added screenshot-based evaluation with agent reasoning; verifies the correct date answer ('03-27-2023'); single evaluator YES.
- Removed YouTube eval and its config entry due to persistent page load failures.

^{Written for commit bd1e96d. Summary will update on new commits.}

changeset-bot · 2026-01-06T21:50:20Z

⚠️ No Changeset found

Latest commit: bd1e96d

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

cubic-dev-ai

1 issue found across 6 files

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="packages/evals/tasks/agent/iframe_form_multiple.ts">

<violation number="1" location="packages/evals/tasks/agent/iframe_form_multiple.ts:19">
P2: Instruction updated to use separate first/last name fields and radio button selection, but validation questions were not updated to match. The evaluator still checks for a single 'form name input' rather than separate first/last name fields, and there's no validation for the radio button selection. This inconsistency could cause flaky eval results.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

packages/evals/tasks/agent/iframe_form_multiple.ts

greptile-apps · 2026-01-06T21:53:11Z

Greptile Summary

Simplifies eval success criteria by removing URL-based checks that were causing flaky test failures. The all_recipes, apple_tv, and nba_trades evals now rely solely on the V3 evaluator's YES/NO response instead of combining it with exact URL matching. The iframe_form_multiple eval was refactored to use ScreenshotCollector with agent reasoning for more reliable validation. The youtube eval was removed entirely due to persistent page loading issues.

Key changes:

Removed URL checks from all_recipes, apple_tv, and nba_trades - success now determined by evaluator alone
Updated iframe_form_multiple to collect screenshots during execution and pass them to the evaluator with agent reasoning
Deleted youtube eval and its config entry
Improved error handling in iframe_form_multiple to properly stringify error messages

Confidence Score: 5/5

Safe to merge - addresses flaky test issues with straightforward logic simplification
Changes are focused and address the stated problem of flaky evals. The removal of URL checks reduces brittleness while maintaining meaningful validation through the evaluator. The iframe_form_multiple refactor uses established patterns from the codebase. No breaking changes or risky logic introduced.
No files require special attention

Important Files Changed

Filename	Overview
packages/evals/tasks/agent/all_recipes.ts	Removed URL check, now only relies on evaluator YES/NO to determine success
packages/evals/tasks/agent/apple_tv.ts	Removed URL check, now only relies on evaluator YES/NO to determine success
packages/evals/tasks/agent/nba_trades.ts	Removed URL check, now only relies on evaluator YES/NO to determine success
packages/evals/tasks/agent/iframe_form_multiple.ts	Completely refactored to use ScreenshotCollector with agent reasoning, updated instructions and validation approach

Sequence Diagram

sequenceDiagram
    participant Test as Eval Test
    participant Agent as Stagehand Agent
    participant Page as Web Page
    participant Eval as V3 Evaluator
    participant SC as ScreenshotCollector

    Test->>Page: Navigate to target URL
    
    alt iframe_form_multiple (Screenshot-based)
        Test->>SC: Initialize & start()
        SC->>SC: Capture screenshots every 3s
        Test->>Agent: Execute task instruction
        Agent->>Page: Interact with page
        SC->>SC: Collect screenshots during journey
        Agent-->>Test: Return result with reasoning
        Test->>SC: stop()
        SC-->>Test: Return collected screenshots
        Test->>Eval: ask() with screenshots + agent reasoning
        Eval-->>Test: YES/NO evaluation
    else all_recipes/apple_tv/nba_trades (Evaluator-only)
        Test->>Agent: Execute task instruction
        Agent->>Page: Interact with page
        Agent-->>Test: Return result
        Test->>Eval: ask() with question
        Eval-->>Test: YES/NO evaluation
    end
    
    Test->>Test: Determine success (evaluation === "YES")

greptile-apps

Additional Comments (1)

packages/evals/tasks/agent/iframe_form_multiple.ts, line 28 (link)

logic: the validation checks for 'John Smith' as a single field value, but the instruction now asks to fill 'John' and 'Smith' in separate first name and last name fields - this will cause the eval to fail

_{6 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

packages/evals/tasks/agent/nba_trades.ts

packages/evals/tasks/agent/iframe_form_multiple.ts

cubic-dev-ai

No issues found across 6 files

cubic-dev-ai

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="packages/evals/tasks/agent/github_react_version.ts">

<violation number="1" location="packages/evals/tasks/agent/github_react_version.ts:21">
P2: If `agent.execute()` throws an error, `screenshotCollector.stop()` is never called, leaving the interval running. Consider wrapping the agent execution and screenshot collection in a try-finally block to ensure cleanup.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

packages/evals/tasks/agent/github_react_version.ts

update flaky evals

8c11e83

cubic-dev-ai bot reviewed Jan 6, 2026

View reviewed changes

packages/evals/tasks/agent/iframe_form_multiple.ts Show resolved Hide resolved

greptile-apps bot reviewed Jan 6, 2026

View reviewed changes

packages/evals/tasks/agent/nba_trades.ts Outdated Show resolved Hide resolved

packages/evals/tasks/agent/iframe_form_multiple.ts Outdated Show resolved Hide resolved

update

315291d

tkattkat marked this pull request as draft January 6, 2026 22:01

adjust iframe eval to use screenshot collector / agents reasoning

635dced

tkattkat marked this pull request as ready for review January 6, 2026 22:07

cubic-dev-ai bot reviewed Jan 6, 2026

View reviewed changes

update github task to use screenshot collector

65ddff2

cubic-dev-ai bot reviewed Jan 6, 2026

View reviewed changes

packages/evals/tasks/agent/github_react_version.ts Show resolved Hide resolved

tkattkat added 2 commits January 6, 2026 14:46

switch to using screenshot collector

7c000b4

format + update

bd1e96d

seanmcguire12 approved these changes Jan 7, 2026

View reviewed changes

tkattkat merged commit 01216a6 into main Jan 7, 2026
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

update flaky evals #1507

update flaky evals #1507

Uh oh!

tkattkat commented Jan 6, 2026 •

edited by cubic-dev-ai bot

Loading

Uh oh!

changeset-bot bot commented Jan 6, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

Uh oh!

greptile-apps bot commented Jan 6, 2026 •

edited

Loading

Uh oh!

greptile-apps bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

update flaky evals #1507

update flaky evals #1507

Uh oh!

Conversation

tkattkat commented Jan 6, 2026 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

why

what changed

test plan

Summary by cubic

Uh oh!

changeset-bot bot commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (1)

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tkattkat commented Jan 6, 2026 •

edited by cubic-dev-ai bot

Loading

changeset-bot bot commented Jan 6, 2026 •

edited

Loading

greptile-apps bot commented Jan 6, 2026 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading