-
Notifications
You must be signed in to change notification settings - Fork 1.3k
update flaky evals #1507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update flaky evals #1507
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 issue found across 6 files
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="packages/evals/tasks/agent/iframe_form_multiple.ts">
<violation number="1" location="packages/evals/tasks/agent/iframe_form_multiple.ts:19">
P2: Instruction updated to use separate first/last name fields and radio button selection, but validation questions were not updated to match. The evaluator still checks for a single 'form name input' rather than separate first/last name fields, and there's no validation for the radio button selection. This inconsistency could cause flaky eval results.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Greptile SummarySimplifies eval success criteria by removing URL-based checks that were causing flaky test failures. The Key changes:
Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Test as Eval Test
participant Agent as Stagehand Agent
participant Page as Web Page
participant Eval as V3 Evaluator
participant SC as ScreenshotCollector
Test->>Page: Navigate to target URL
alt iframe_form_multiple (Screenshot-based)
Test->>SC: Initialize & start()
SC->>SC: Capture screenshots every 3s
Test->>Agent: Execute task instruction
Agent->>Page: Interact with page
SC->>SC: Collect screenshots during journey
Agent-->>Test: Return result with reasoning
Test->>SC: stop()
SC-->>Test: Return collected screenshots
Test->>Eval: ask() with screenshots + agent reasoning
Eval-->>Test: YES/NO evaluation
else all_recipes/apple_tv/nba_trades (Evaluator-only)
Test->>Agent: Execute task instruction
Agent->>Page: Interact with page
Agent-->>Test: Return result
Test->>Eval: ask() with question
Eval-->>Test: YES/NO evaluation
end
Test->>Test: Determine success (evaluation === "YES")
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
packages/evals/tasks/agent/iframe_form_multiple.ts, line 28 (link)logic: the validation checks for 'John Smith' as a single field value, but the instruction now asks to fill 'John' and 'Smith' in separate first name and last name fields - this will cause the eval to fail
6 files reviewed, 3 comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No issues found across 6 files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 issue found across 1 file (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="packages/evals/tasks/agent/github_react_version.ts">
<violation number="1" location="packages/evals/tasks/agent/github_react_version.ts:21">
P2: If `agent.execute()` throws an error, `screenshotCollector.stop()` is never called, leaving the interval running. Consider wrapping the agent execution and screenshot collection in a try-finally block to ensure cleanup.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
why
currently there is a url check condition, which often causes apple_tv and nbaTrades to fail.
the youtube eval also always fails due to the page not loadkng
what changed
test plan
ran the evals
Summary by cubic
Reduce flaky agent evals by simplifying success checks and using screenshot-based evaluation with agent reasoning. Removed the broken YouTube task and improved reliability across Apple TV, AllRecipes, Google Maps, GitHub React version, and the arXiv GPT-4 report.
Written for commit bd1e96d. Summary will update on new commits.