Skip to content

Conversation

@tkattkat
Copy link
Collaborator

@tkattkat tkattkat commented Dec 15, 2025

why

we need more evals for agent

what changed

  • Added 19 new evals composed primarily of "hard" level tasks from public datasets such as onlineMind2web
  • Updated evals to import agent from agent, rather than v3Agent, as it was an incorrect import causing tasks to fail

test plan

ran evals


Summary by cubic

Added 18 new hard-level agent evals and fixed the agent import to use the correct agent, improving coverage and stability of browser tasks.

  • New Features

    • Added evals for diverse sites (Amazon cart, KFC order, Redfin rentals, Flipkart filters, WebMD tools, Trustpilot, Uniqlo, Alibaba, NVIDIA drivers, OED search, Radiotimes, TheGamer, Trailhead, etc.).
    • Integrated ScreenshotCollector in new evals to capture journeys for better automated evaluation.
    • Updated evals.config.json to register all new tasks under the agent category.
  • Bug Fixes

    • Replaced v3Agent with agent across existing evals to prevent task failures.
    • Standardized agent.execute usage and evaluation flow to improve reliability.

Written for commit b947d97. Summary will update automatically on new commits.

@changeset-bot
Copy link

changeset-bot bot commented Dec 15, 2025

⚠️ No Changeset found

Latest commit: b947d97

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@tkattkat tkattkat changed the title Add more agent evals Add more agent evals to evals cli Dec 15, 2025
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Dec 16, 2025

Greptile Overview

Greptile Summary

This PR adds 19 new agent evaluation tasks and fixes a critical import bug across all existing agent evals.

  • Bug Fix: Changed all agent evals to use agent instead of v3Agent - the types show v3Agent is optional while agent is required, which was causing eval failures
  • New Evals: Added 19 diverse "hard" level tasks from public datasets (onlineMind2Web), covering e-commerce (Amazon, Alibaba, Flipkart, Uniqlo), travel (hotels.com, Redfin, Kayak), food ordering (KFC, Instacart), and specialized sites (NVIDIA drivers, OED, WebMD)
  • Consistent Pattern: New evals use a screenshot-based evaluation approach with ScreenshotCollector for better agent journey tracking
  • Config Updated: All new evals properly registered in evals.config.json under the "agent" category

Confidence Score: 5/5

  • This PR is safe to merge - it adds test infrastructure and fixes a bug, with no changes to production code
  • High confidence because the changes are isolated to eval/test files only, the import fix is clearly correct based on the type definitions, and all new evals follow established patterns
  • No files require special attention

Important Files Changed

File Analysis

Filename Score Overview
packages/evals/evals.config.json 5/5 Added 19 new agent eval entries to the config, all correctly categorized under "agent" category
packages/evals/tasks/agent/alibaba_supplier_search.ts 5/5 New agent eval for searching suppliers on Alibaba, uses screenshot-based evaluation pattern
packages/evals/tasks/agent/amazon_shoes_cart.ts 5/5 New agent eval for adding shoes to Amazon cart, follows consistent screenshot evaluation pattern
packages/evals/tasks/agent/redfin_apartment_rental.ts 5/5 New agent eval for finding apartment rentals on Redfin, uses dynamic date calculation
packages/evals/tasks/agent/trailhead_superbadge.ts 5/5 New agent eval for finding Salesforce Trailhead tasks, includes captureOnNavigation option
packages/evals/tasks/agent/all_recipes.ts 5/5 Fixed import from v3Agent to agent for correct agent reference
packages/evals/tasks/agent/iframe_form.ts 5/5 Fixed import from v3Agent to agent in multiple places
packages/evals/tasks/agent/kayak.ts 5/5 Fixed import from v3Agent to agent in multiple places
packages/evals/tasks/agent/kith.ts 5/5 Fixed import from v3Agent to agent in multiple places

Sequence Diagram

sequenceDiagram
    participant Test as Eval Runner
    participant Agent as Agent Instance
    participant SC as ScreenshotCollector
    participant Page as Browser Page
    participant Eval as V3Evaluator

    Test->>Page: goto(target URL)
    Test->>SC: new ScreenshotCollector(page)
    Test->>SC: start()
    SC->>Page: capture screenshots (interval)
    Test->>Agent: execute(instruction, maxSteps)
    Agent->>Page: perform actions
    Agent-->>Test: agentResult
    Test->>SC: stop()
    SC-->>Test: screenshots[]
    Test->>Eval: ask(question, screenshots, agentReasoning)
    Eval-->>Test: evaluation, reasoning
    Test-->>Test: return success/failure
Loading

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 44 files

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

44 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@tkattkat tkattkat merged commit 33fba24 into main Dec 16, 2025
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants