Add more agent evals to evals cli #1422

tkattkat · 2025-12-15T23:58:28Z

why

we need more evals for agent

what changed

Added 19 new evals composed primarily of "hard" level tasks from public datasets such as onlineMind2web
Updated evals to import agent from agent, rather than v3Agent, as it was an incorrect import causing tasks to fail

test plan

ran evals

Summary by cubic

Added 18 new hard-level agent evals and fixed the agent import to use the correct agent, improving coverage and stability of browser tasks.

New Features
- Added evals for diverse sites (Amazon cart, KFC order, Redfin rentals, Flipkart filters, WebMD tools, Trustpilot, Uniqlo, Alibaba, NVIDIA drivers, OED search, Radiotimes, TheGamer, Trailhead, etc.).
- Integrated ScreenshotCollector in new evals to capture journeys for better automated evaluation.
- Updated evals.config.json to register all new tasks under the agent category.
Bug Fixes
- Replaced v3Agent with agent across existing evals to prevent task failures.
- Standardized agent.execute usage and evaluation flow to improve reliability.

^{Written for commit b947d97. Summary will update automatically on new commits.}

changeset-bot · 2025-12-15T23:58:32Z

⚠️ No Changeset found

Latest commit: b947d97

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

greptile-apps · 2025-12-16T00:01:58Z

Greptile Overview

Greptile Summary

This PR adds 19 new agent evaluation tasks and fixes a critical import bug across all existing agent evals.

Bug Fix: Changed all agent evals to use agent instead of v3Agent - the types show v3Agent is optional while agent is required, which was causing eval failures
New Evals: Added 19 diverse "hard" level tasks from public datasets (onlineMind2Web), covering e-commerce (Amazon, Alibaba, Flipkart, Uniqlo), travel (hotels.com, Redfin, Kayak), food ordering (KFC, Instacart), and specialized sites (NVIDIA drivers, OED, WebMD)
Consistent Pattern: New evals use a screenshot-based evaluation approach with ScreenshotCollector for better agent journey tracking
Config Updated: All new evals properly registered in evals.config.json under the "agent" category

Confidence Score: 5/5

This PR is safe to merge - it adds test infrastructure and fixes a bug, with no changes to production code
High confidence because the changes are isolated to eval/test files only, the import fix is clearly correct based on the type definitions, and all new evals follow established patterns
No files require special attention

Important Files Changed

File Analysis

Filename	Score	Overview
packages/evals/evals.config.json	5/5	Added 19 new agent eval entries to the config, all correctly categorized under "agent" category
packages/evals/tasks/agent/alibaba_supplier_search.ts	5/5	New agent eval for searching suppliers on Alibaba, uses screenshot-based evaluation pattern
packages/evals/tasks/agent/amazon_shoes_cart.ts	5/5	New agent eval for adding shoes to Amazon cart, follows consistent screenshot evaluation pattern
packages/evals/tasks/agent/redfin_apartment_rental.ts	5/5	New agent eval for finding apartment rentals on Redfin, uses dynamic date calculation
packages/evals/tasks/agent/trailhead_superbadge.ts	5/5	New agent eval for finding Salesforce Trailhead tasks, includes captureOnNavigation option
packages/evals/tasks/agent/all_recipes.ts	5/5	Fixed import from v3Agent to agent for correct agent reference
packages/evals/tasks/agent/iframe_form.ts	5/5	Fixed import from v3Agent to agent in multiple places
packages/evals/tasks/agent/kayak.ts	5/5	Fixed import from v3Agent to agent in multiple places
packages/evals/tasks/agent/kith.ts	5/5	Fixed import from v3Agent to agent in multiple places

Sequence Diagram

sequenceDiagram
    participant Test as Eval Runner
    participant Agent as Agent Instance
    participant SC as ScreenshotCollector
    participant Page as Browser Page
    participant Eval as V3Evaluator

    Test->>Page: goto(target URL)
    Test->>SC: new ScreenshotCollector(page)
    Test->>SC: start()
    SC->>Page: capture screenshots (interval)
    Test->>Agent: execute(instruction, maxSteps)
    Agent->>Page: perform actions
    Agent-->>Test: agentResult
    Test->>SC: stop()
    SC-->>Test: screenshots[]
    Test->>Eval: ask(question, screenshots, agentReasoning)
    Eval-->>Test: evaluation, reasoning
    Test-->>Test: return success/failure

cubic-dev-ai

No issues found across 44 files

greptile-apps

_{44 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

tkattkat added 3 commits December 15, 2025 14:47

add more agent evals

f87faa8

revert accidental changes + update evals config

3e82b30

update imports

60a0389

tkattkat changed the title ~~Add more agent evals~~ Add more agent evals to evals cli Dec 15, 2025

format

b947d97

cubic-dev-ai bot reviewed Dec 16, 2025

View reviewed changes

greptile-apps bot reviewed Dec 16, 2025

View reviewed changes

pirate approved these changes Dec 16, 2025

View reviewed changes

tkattkat merged commit 33fba24 into main Dec 16, 2025
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add more agent evals to evals cli #1422

Add more agent evals to evals cli #1422

Uh oh!

tkattkat commented Dec 15, 2025 •

edited by cubic-dev-ai bot

Loading

Uh oh!

changeset-bot bot commented Dec 15, 2025 •

edited

Loading

Uh oh!

greptile-apps bot commented Dec 16, 2025

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add more agent evals to evals cli #1422

Add more agent evals to evals cli #1422

Uh oh!

Conversation

tkattkat commented Dec 15, 2025 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

why

what changed

test plan

Summary by cubic

Uh oh!

changeset-bot bot commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

greptile-apps bot commented Dec 16, 2025

Greptile Overview

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tkattkat commented Dec 15, 2025 •

edited by cubic-dev-ai bot

Loading

changeset-bot bot commented Dec 15, 2025 •

edited

Loading