Update the agent evals cli #1364

tkattkat · 2025-12-04T23:57:00Z

why

After the transition to v3, the model handling for agent evals was not updated to account for new model formats

what changed

added isCua flag and two separate model maps to allow for models that can be ran with cua and non
adjusted model handling to properly parse cua models
added tag to distinguish if the run is using cua or non

test plan

tested evals for cua, and non cua

Summary by cubic

Updated the agent evals CLI to support and correctly run both CUA and non-CUA agent models in v3. Fixes agent model parsing and enables mixed eval runs.

New Features
- Split agent models into standard and CUA lists; added getAgentModelEntries with a cua flag.
- Passed isCUA through EvalInput to initV3 and tasks; selects a safe internal model for handlers when CUA.
- Improved provider lookup and error messages for CUA models using short names; testcases now tag models as "cua" or "agent".

^{Written for commit 13b906c. Summary will update automatically on new commits.}

changeset-bot · 2025-12-04T23:57:04Z

🦋 Changeset detected

Latest commit: 13b906c

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package

Name	Type
@browserbasehq/stagehand-evals	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

greptile-apps · 2025-12-04T23:59:32Z

Greptile Overview

Greptile Summary

This PR fixes model handling in the agent evals CLI to properly support provider-prefixed model names (e.g., anthropic/claude-sonnet-4-20250514) in the v3 transition.

Key Changes:

Separated agent models into CUA and non-CUA categories with explicit isCUA flag instead of auto-detection
Added model name parsing in initV3.ts to extract short names from provider-prefixed formats before looking them up in modelToAgentProviderMap
Created AgentModelEntry type to track which models should run with cua: true
Updated test case generation to properly pass the isCUA flag through to initialization

Technical Details:
The old code attempted to check if a model was CUA using modelName in modelToAgentProviderMap, which failed for provider-prefixed names. The new code explicitly passes the isCUA flag from the configuration and extracts the short model name (e.g., claude-sonnet-4-20250514) before the lookup, matching the pattern already used in AgentProvider.getAgentProvider().

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The changes are well-structured and fix a clear bug in model name handling. The logic correctly mirrors the existing pattern in AgentProvider.getAgentProvider() for parsing provider-prefixed model names. The separation of CUA and non-CUA models is explicit and clear. The PR has been tested with both CUA and non-CUA evals according to the test plan.
No files require special attention

Important Files Changed

File Analysis

Filename	Score	Overview
packages/evals/taskConfig.ts	5/5	Split agent models into CUA and non-CUA lists, created unified `AGENT_MODEL_ENTRIES` structure with `cua` flags, exported `getAgentModelEntries` function
packages/evals/index.eval.ts	5/5	Updated test case generation to use `AgentModelEntry` objects with `cua` flags, passed `isCUA` to `initV3`, added CUA-specific tags to test cases
packages/evals/initV3.ts	5/5	Changed from auto-detecting CUA models via `modelToAgentProviderMap` to explicit `isCUA` parameter, added model name parsing to extract short name from provider-prefixed models (e.g., `anthropic/claude-sonnet-4-20250514` → `claude-sonnet-4-20250514`)

Sequence Diagram

sequenceDiagram
    participant CLI as Eval CLI
    participant Config as taskConfig.ts
    participant IndexEval as index.eval.ts
    participant InitV3 as initV3.ts
    participant AgentProvider as AgentProvider.ts
    
    CLI->>Config: getAgentModelEntries()
    Config-->>CLI: [{modelName, cua: true/false}]
    
    CLI->>IndexEval: Generate test cases
    Note over IndexEval: For agent categories
    IndexEval->>Config: getAgentModelEntries()
    Config-->>IndexEval: AGENT_MODEL_ENTRIES
    
    loop Each model entry
        IndexEval->>IndexEval: Create testcase with isCUA flag
        Note over IndexEval: input: {name, modelName, isCUA}
    end
    
    IndexEval->>InitV3: initV3({modelName, isCUA, ...})
    
    alt isCUA is true
        InitV3->>InitV3: Extract short model name
        Note over InitV3: "anthropic/model" → "model"
        InitV3->>AgentProvider: Lookup in modelToAgentProviderMap
        AgentProvider-->>InitV3: Provider type (openai/anthropic/google)
        InitV3->>InitV3: Create agent with cua: true
    else isCUA is false
        InitV3->>InitV3: Create agent without CUA
    end
    
    InitV3-->>IndexEval: V3InitResult with agent

greptile-apps

_{4 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

cubic-dev-ai

1 issue found across 5 files

Prompt for AI agents (all 1 issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="packages/evals/taskConfig.ts">

<violation number="1" location="packages/evals/taskConfig.ts:118">
P2: The model `anthropic/claude-sonnet-4-20250514` appears in both `AGENT_MODELS` and `AGENT_MODELS_CUA`, causing `DEFAULT_AGENT_MODELS` to contain duplicate entries. If this is intentional (to test the same model with both CUA modes), consider adding a comment. Otherwise, remove it from one of the arrays to avoid duplicate test runs.</violation>
</file>

_{Reply to cubic to teach it or ask questions. Re-run a review with @cubic-dev-ai review this PR}

packages/evals/taskConfig.ts

@miguelg719

This PR was opened by the [Changesets release](https://github.com/changesets/action) GitHub action. When you're ready to do a release, you can merge this and the packages will be published to npm automatically. If you're not ready to do a release yet, that's fine, whenever you add more changesets to main, this PR will be updated. # Releases ## @browserbasehq/[email protected] ### Patch Changes - [#1388](#1388) [`605ed6b`](605ed6b) Thanks [@miguelg719](https://github.com/miguelg719)! - Fix multiple click event dispatches on CDP and Anthropic CUA handling (double clicks) - [#1400](#1400) [`34e7e5b`](34e7e5b) Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - don't write base64 encoded screenshots to disk when caching agent actions - [#1345](#1345) [`943d2d7`](943d2d7) Thanks [@tkattkat](https://github.com/tkattkat)! - Add support for aborting / stopping an agent run & continuing an agent run using messages from prior runs - [#1334](#1334) [`0e95cd2`](0e95cd2) Thanks [@tkattkat](https://github.com/tkattkat)! - Add support for google vertex provider - [#1410](#1410) [`d4237e4`](d4237e4) Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - fix: include extract in stagehand.history() - [#1315](#1315) [`86975e7`](86975e7) Thanks [@tkattkat](https://github.com/tkattkat)! - Add streaming support to agent through stream:true in the agent config - [#1304](#1304) [`d5e119b`](d5e119b) Thanks [@miguelg719](https://github.com/miguelg719)! - Add support for Microsoft's Fara-7B - [#1346](#1346) [`4e051b2`](4e051b2) Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - fix: don't attach to targets twice - [#1327](#1327) [`6b5a3c9`](6b5a3c9) Thanks [@miguelg719](https://github.com/miguelg719)! - Informed error parsing from api - [#1335](#1335) [`bb85ad9`](bb85ad9) Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - add support for page.addInitScript() - [#1331](#1331) [`88d28cc`](88d28cc) Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - fix: page.evaluate() now works with scripts injected via context.addInitScript() - [#1316](#1316) [`45bcef0`](45bcef0) Thanks [@tkattkat](https://github.com/tkattkat)! - Add support for callbacks in stagehand agent - [#1374](#1374) [`6aa9d45`](6aa9d45) Thanks [@miguelg719](https://github.com/miguelg719)! - Fix key action mapping in Anthropic CUA - [#1330](#1330) [`d382084`](d382084) Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - fix: make act, extract, and observe respect user defined timeout param - [#1336](#1336) [`1df08cc`](1df08cc) Thanks [@tkattkat](https://github.com/tkattkat)! - Patch agent on api - [#1358](#1358) [`2b56600`](2b56600) Thanks [@tkattkat](https://github.com/tkattkat)! - Add support for 4.5 opus in cua agent ## @browserbasehq/[email protected] ### Patch Changes - [#1364](#1364) [`ca0630e`](ca0630e) Thanks [@tkattkat](https://github.com/tkattkat)! - Update model handling in agent evals cli - Updated dependencies \[[`605ed6b`](605ed6b), [`34e7e5b`](34e7e5b), [`943d2d7`](943d2d7), [`0e95cd2`](0e95cd2), [`d4237e4`](d4237e4), [`86975e7`](86975e7), [`d5e119b`](d5e119b), [`4e051b2`](4e051b2), [`6b5a3c9`](6b5a3c9), [`bb85ad9`](bb85ad9), [`88d28cc`](88d28cc), [`45bcef0`](45bcef0), [`6aa9d45`](6aa9d45), [`d382084`](d382084), [`1df08cc`](1df08cc), [`2b56600`](2b56600)]: - @browserbasehq/[email protected] ## @browserbasehq/[email protected] ### Patch Changes - Updated dependencies \[[`605ed6b`](605ed6b), [`34e7e5b`](34e7e5b), [`943d2d7`](943d2d7), [`0e95cd2`](0e95cd2), [`d4237e4`](d4237e4), [`86975e7`](86975e7), [`d5e119b`](d5e119b), [`4e051b2`](4e051b2), [`6b5a3c9`](6b5a3c9), [`bb85ad9`](bb85ad9), [`88d28cc`](88d28cc), [`45bcef0`](45bcef0), [`6aa9d45`](6aa9d45), [`d382084`](d382084), [`1df08cc`](1df08cc), [`2b56600`](2b56600)]: - @browserbasehq/[email protected] Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

@miguelg719

tkattkat added 2 commits December 4, 2025 15:52

update the agent evals cli

6bef592

changeset

13b906c

greptile-apps bot reviewed Dec 4, 2025

View reviewed changes

cubic-dev-ai bot reviewed Dec 5, 2025

View reviewed changes

packages/evals/taskConfig.ts Show resolved Hide resolved

miguelg719 approved these changes Dec 5, 2025

View reviewed changes

miguelg719 added the observe These changes pertain to the observe function label Dec 5, 2025

tkattkat merged commit ca0630e into main Dec 5, 2025
36 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update the agent evals cli #1364

Update the agent evals cli #1364

Uh oh!

tkattkat commented Dec 4, 2025 •

edited

Loading

Uh oh!

changeset-bot bot commented Dec 4, 2025

Uh oh!

greptile-apps bot commented Dec 4, 2025

Uh oh!

greptile-apps bot left a comment

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Update the agent evals cli #1364

Update the agent evals cli #1364

Uh oh!

Conversation

tkattkat commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

why

what changed

test plan

Summary by cubic

Uh oh!

changeset-bot bot commented Dec 4, 2025

🦋 Changeset detected

Uh oh!

greptile-apps bot commented Dec 4, 2025

Greptile Overview

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tkattkat commented Dec 4, 2025 •

edited

Loading