Skip to content

Conversation

@tkattkat
Copy link
Collaborator

@tkattkat tkattkat commented Jan 6, 2026

why

currently there is a url check condition, which often causes apple_tv and nbaTrades to fail.
the youtube eval also always fails due to the page not loadkng

what changed

  • update evals with url checks to not use url checks,
  • remove youtube eval

test plan

ran the evals


Summary by cubic

Reduce flaky agent evals by simplifying success checks and using screenshot-based evaluation with agent reasoning. Removed the broken YouTube task and improved reliability across Apple TV, AllRecipes, Google Maps, GitHub React version, and the arXiv GPT-4 report.

  • Bug Fixes
    • all_recipes, apple_tv, nba_trades: success now relies only on agent evaluation; removed URL checks.
    • iframe_form_multiple: updated instructions and switched to screenshot-based evaluation using agent reasoning; single evaluator YES.
    • github_react_version: switched to screenshot-based evaluation using agent reasoning; single evaluator YES.
    • google_maps and google_maps_2: switched to screenshot-based evaluation using agent reasoning; simplified success to a single evaluator YES.
    • arxiv_gpt_report: added screenshot-based evaluation with agent reasoning; verifies the correct date answer ('03-27-2023'); single evaluator YES.
    • Removed YouTube eval and its config entry due to persistent page load failures.

Written for commit bd1e96d. Summary will update on new commits.

@changeset-bot
Copy link

changeset-bot bot commented Jan 6, 2026

⚠️ No Changeset found

Latest commit: bd1e96d

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 6 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="packages/evals/tasks/agent/iframe_form_multiple.ts">

<violation number="1" location="packages/evals/tasks/agent/iframe_form_multiple.ts:19">
P2: Instruction updated to use separate first/last name fields and radio button selection, but validation questions were not updated to match. The evaluator still checks for a single 'form name input' rather than separate first/last name fields, and there's no validation for the radio button selection. This inconsistency could cause flaky eval results.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 6, 2026

Greptile Summary

Simplifies eval success criteria by removing URL-based checks that were causing flaky test failures. The all_recipes, apple_tv, and nba_trades evals now rely solely on the V3 evaluator's YES/NO response instead of combining it with exact URL matching. The iframe_form_multiple eval was refactored to use ScreenshotCollector with agent reasoning for more reliable validation. The youtube eval was removed entirely due to persistent page loading issues.

Key changes:

  • Removed URL checks from all_recipes, apple_tv, and nba_trades - success now determined by evaluator alone
  • Updated iframe_form_multiple to collect screenshots during execution and pass them to the evaluator with agent reasoning
  • Deleted youtube eval and its config entry
  • Improved error handling in iframe_form_multiple to properly stringify error messages

Confidence Score: 5/5

  • Safe to merge - addresses flaky test issues with straightforward logic simplification
  • Changes are focused and address the stated problem of flaky evals. The removal of URL checks reduces brittleness while maintaining meaningful validation through the evaluator. The iframe_form_multiple refactor uses established patterns from the codebase. No breaking changes or risky logic introduced.
  • No files require special attention

Important Files Changed

Filename Overview
packages/evals/tasks/agent/all_recipes.ts Removed URL check, now only relies on evaluator YES/NO to determine success
packages/evals/tasks/agent/apple_tv.ts Removed URL check, now only relies on evaluator YES/NO to determine success
packages/evals/tasks/agent/nba_trades.ts Removed URL check, now only relies on evaluator YES/NO to determine success
packages/evals/tasks/agent/iframe_form_multiple.ts Completely refactored to use ScreenshotCollector with agent reasoning, updated instructions and validation approach

Sequence Diagram

sequenceDiagram
    participant Test as Eval Test
    participant Agent as Stagehand Agent
    participant Page as Web Page
    participant Eval as V3 Evaluator
    participant SC as ScreenshotCollector

    Test->>Page: Navigate to target URL
    
    alt iframe_form_multiple (Screenshot-based)
        Test->>SC: Initialize & start()
        SC->>SC: Capture screenshots every 3s
        Test->>Agent: Execute task instruction
        Agent->>Page: Interact with page
        SC->>SC: Collect screenshots during journey
        Agent-->>Test: Return result with reasoning
        Test->>SC: stop()
        SC-->>Test: Return collected screenshots
        Test->>Eval: ask() with screenshots + agent reasoning
        Eval-->>Test: YES/NO evaluation
    else all_recipes/apple_tv/nba_trades (Evaluator-only)
        Test->>Agent: Execute task instruction
        Agent->>Page: Interact with page
        Agent-->>Test: Return result
        Test->>Eval: ask() with question
        Eval-->>Test: YES/NO evaluation
    end
    
    Test->>Test: Determine success (evaluation === "YES")

Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. packages/evals/tasks/agent/iframe_form_multiple.ts, line 28 (link)

    logic: the validation checks for 'John Smith' as a single field value, but the instruction now asks to fill 'John' and 'Smith' in separate first name and last name fields - this will cause the eval to fail

6 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

@tkattkat tkattkat marked this pull request as draft January 6, 2026 22:01
@tkattkat tkattkat marked this pull request as ready for review January 6, 2026 22:07
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 6 files

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="packages/evals/tasks/agent/github_react_version.ts">

<violation number="1" location="packages/evals/tasks/agent/github_react_version.ts:21">
P2: If `agent.execute()` throws an error, `screenshotCollector.stop()` is never called, leaving the interval running. Consider wrapping the agent execution and screenshot collection in a try-finally block to ensure cleanup.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@tkattkat tkattkat merged commit 01216a6 into main Jan 7, 2026
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants