Skip to content

Conversation

@tkattkat
Copy link
Collaborator

@tkattkat tkattkat commented Dec 4, 2025

why

After the transition to v3, the model handling for agent evals was not updated to account for new model formats

what changed

  • added isCua flag and two separate model maps to allow for models that can be ran with cua and non
  • adjusted model handling to properly parse cua models
  • added tag to distinguish if the run is using cua or non

test plan

  • tested evals for cua, and non cua

Summary by cubic

Updated the agent evals CLI to support and correctly run both CUA and non-CUA agent models in v3. Fixes agent model parsing and enables mixed eval runs.

  • New Features
    • Split agent models into standard and CUA lists; added getAgentModelEntries with a cua flag.
    • Passed isCUA through EvalInput to initV3 and tasks; selects a safe internal model for handlers when CUA.
    • Improved provider lookup and error messages for CUA models using short names; testcases now tag models as "cua" or "agent".

Written for commit 13b906c. Summary will update automatically on new commits.

@changeset-bot
Copy link

changeset-bot bot commented Dec 4, 2025

🦋 Changeset detected

Latest commit: 13b906c

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
@browserbasehq/stagehand-evals Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Dec 4, 2025

Greptile Overview

Greptile Summary

This PR fixes model handling in the agent evals CLI to properly support provider-prefixed model names (e.g., anthropic/claude-sonnet-4-20250514) in the v3 transition.

Key Changes:

  • Separated agent models into CUA and non-CUA categories with explicit isCUA flag instead of auto-detection
  • Added model name parsing in initV3.ts to extract short names from provider-prefixed formats before looking them up in modelToAgentProviderMap
  • Created AgentModelEntry type to track which models should run with cua: true
  • Updated test case generation to properly pass the isCUA flag through to initialization

Technical Details:
The old code attempted to check if a model was CUA using modelName in modelToAgentProviderMap, which failed for provider-prefixed names. The new code explicitly passes the isCUA flag from the configuration and extracts the short model name (e.g., claude-sonnet-4-20250514) before the lookup, matching the pattern already used in AgentProvider.getAgentProvider().

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The changes are well-structured and fix a clear bug in model name handling. The logic correctly mirrors the existing pattern in AgentProvider.getAgentProvider() for parsing provider-prefixed model names. The separation of CUA and non-CUA models is explicit and clear. The PR has been tested with both CUA and non-CUA evals according to the test plan.
  • No files require special attention

Important Files Changed

File Analysis

Filename Score Overview
packages/evals/taskConfig.ts 5/5 Split agent models into CUA and non-CUA lists, created unified AGENT_MODEL_ENTRIES structure with cua flags, exported getAgentModelEntries function
packages/evals/index.eval.ts 5/5 Updated test case generation to use AgentModelEntry objects with cua flags, passed isCUA to initV3, added CUA-specific tags to test cases
packages/evals/initV3.ts 5/5 Changed from auto-detecting CUA models via modelToAgentProviderMap to explicit isCUA parameter, added model name parsing to extract short name from provider-prefixed models (e.g., anthropic/claude-sonnet-4-20250514claude-sonnet-4-20250514)

Sequence Diagram

sequenceDiagram
    participant CLI as Eval CLI
    participant Config as taskConfig.ts
    participant IndexEval as index.eval.ts
    participant InitV3 as initV3.ts
    participant AgentProvider as AgentProvider.ts
    
    CLI->>Config: getAgentModelEntries()
    Config-->>CLI: [{modelName, cua: true/false}]
    
    CLI->>IndexEval: Generate test cases
    Note over IndexEval: For agent categories
    IndexEval->>Config: getAgentModelEntries()
    Config-->>IndexEval: AGENT_MODEL_ENTRIES
    
    loop Each model entry
        IndexEval->>IndexEval: Create testcase with isCUA flag
        Note over IndexEval: input: {name, modelName, isCUA}
    end
    
    IndexEval->>InitV3: initV3({modelName, isCUA, ...})
    
    alt isCUA is true
        InitV3->>InitV3: Extract short model name
        Note over InitV3: "anthropic/model" → "model"
        InitV3->>AgentProvider: Lookup in modelToAgentProviderMap
        AgentProvider-->>InitV3: Provider type (openai/anthropic/google)
        InitV3->>InitV3: Create agent with cua: true
    else isCUA is false
        InitV3->>InitV3: Create agent without CUA
    end
    
    InitV3-->>IndexEval: V3InitResult with agent
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 5 files

Prompt for AI agents (all 1 issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="packages/evals/taskConfig.ts">

<violation number="1" location="packages/evals/taskConfig.ts:118">
P2: The model `anthropic/claude-sonnet-4-20250514` appears in both `AGENT_MODELS` and `AGENT_MODELS_CUA`, causing `DEFAULT_AGENT_MODELS` to contain duplicate entries. If this is intentional (to test the same model with both CUA modes), consider adding a comment. Otherwise, remove it from one of the arrays to avoid duplicate test runs.</violation>
</file>

Reply to cubic to teach it or ask questions. Re-run a review with @cubic-dev-ai review this PR

@miguelg719 miguelg719 added the observe These changes pertain to the observe function label Dec 5, 2025
@tkattkat tkattkat merged commit ca0630e into main Dec 5, 2025
36 checks passed
miguelg719 pushed a commit that referenced this pull request Dec 13, 2025
This PR was opened by the [Changesets
release](https://github.com/changesets/action) GitHub action. When
you're ready to do a release, you can merge this and the packages will
be published to npm automatically. If you're not ready to do a release
yet, that's fine, whenever you add more changesets to main, this PR will
be updated.


# Releases
## @browserbasehq/[email protected]

### Patch Changes

- [#1388](#1388)
[`605ed6b`](605ed6b)
Thanks [@miguelg719](https://github.com/miguelg719)! - Fix multiple
click event dispatches on CDP and Anthropic CUA handling (double clicks)

- [#1400](#1400)
[`34e7e5b`](34e7e5b)
Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - don't write
base64 encoded screenshots to disk when caching agent actions

- [#1345](#1345)
[`943d2d7`](943d2d7)
Thanks [@tkattkat](https://github.com/tkattkat)! - Add support for
aborting / stopping an agent run & continuing an agent run using
messages from prior runs

- [#1334](#1334)
[`0e95cd2`](0e95cd2)
Thanks [@tkattkat](https://github.com/tkattkat)! - Add support for
google vertex provider

- [#1410](#1410)
[`d4237e4`](d4237e4)
Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - fix:
include extract in stagehand.history()

- [#1315](#1315)
[`86975e7`](86975e7)
Thanks [@tkattkat](https://github.com/tkattkat)! - Add streaming support
to agent through stream:true in the agent config

- [#1304](#1304)
[`d5e119b`](d5e119b)
Thanks [@miguelg719](https://github.com/miguelg719)! - Add support for
Microsoft's Fara-7B

- [#1346](#1346)
[`4e051b2`](4e051b2)
Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - fix: don't
attach to targets twice

- [#1327](#1327)
[`6b5a3c9`](6b5a3c9)
Thanks [@miguelg719](https://github.com/miguelg719)! - Informed error
parsing from api

- [#1335](#1335)
[`bb85ad9`](bb85ad9)
Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - add support
for page.addInitScript()

- [#1331](#1331)
[`88d28cc`](88d28cc)
Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - fix:
page.evaluate() now works with scripts injected via
context.addInitScript()

- [#1316](#1316)
[`45bcef0`](45bcef0)
Thanks [@tkattkat](https://github.com/tkattkat)! - Add support for
callbacks in stagehand agent

- [#1374](#1374)
[`6aa9d45`](6aa9d45)
Thanks [@miguelg719](https://github.com/miguelg719)! - Fix key action
mapping in Anthropic CUA

- [#1330](#1330)
[`d382084`](d382084)
Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - fix: make
act, extract, and observe respect user defined timeout param

- [#1336](#1336)
[`1df08cc`](1df08cc)
Thanks [@tkattkat](https://github.com/tkattkat)! - Patch agent on api

- [#1358](#1358)
[`2b56600`](2b56600)
Thanks [@tkattkat](https://github.com/tkattkat)! - Add support for 4.5
opus in cua agent

## @browserbasehq/[email protected]

### Patch Changes

- [#1364](#1364)
[`ca0630e`](ca0630e)
Thanks [@tkattkat](https://github.com/tkattkat)! - Update model handling
in agent evals cli

- Updated dependencies
\[[`605ed6b`](605ed6b),
[`34e7e5b`](34e7e5b),
[`943d2d7`](943d2d7),
[`0e95cd2`](0e95cd2),
[`d4237e4`](d4237e4),
[`86975e7`](86975e7),
[`d5e119b`](d5e119b),
[`4e051b2`](4e051b2),
[`6b5a3c9`](6b5a3c9),
[`bb85ad9`](bb85ad9),
[`88d28cc`](88d28cc),
[`45bcef0`](45bcef0),
[`6aa9d45`](6aa9d45),
[`d382084`](d382084),
[`1df08cc`](1df08cc),
[`2b56600`](2b56600)]:
    -   @browserbasehq/[email protected]

## @browserbasehq/[email protected]

### Patch Changes

- Updated dependencies
\[[`605ed6b`](605ed6b),
[`34e7e5b`](34e7e5b),
[`943d2d7`](943d2d7),
[`0e95cd2`](0e95cd2),
[`d4237e4`](d4237e4),
[`86975e7`](86975e7),
[`d5e119b`](d5e119b),
[`4e051b2`](4e051b2),
[`6b5a3c9`](6b5a3c9),
[`bb85ad9`](bb85ad9),
[`88d28cc`](88d28cc),
[`45bcef0`](45bcef0),
[`6aa9d45`](6aa9d45),
[`d382084`](d382084),
[`1df08cc`](1df08cc),
[`2b56600`](2b56600)]:
    -   @browserbasehq/[email protected]

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
michaelfp930-WB added a commit to michaelfp930-WB/stagehand that referenced this pull request Jan 12, 2026
This PR was opened by the [Changesets
release](https://github.com/changesets/action) GitHub action. When
you're ready to do a release, you can merge this and the packages will
be published to npm automatically. If you're not ready to do a release
yet, that's fine, whenever you add more changesets to main, this PR will
be updated.


# Releases
## @browserbasehq/[email protected]

### Patch Changes

- [#1388](browserbase/stagehand#1388)
[`605ed6b`](browserbase/stagehand@605ed6b)
Thanks [@miguelg719](https://github.com/miguelg719)! - Fix multiple
click event dispatches on CDP and Anthropic CUA handling (double clicks)

- [#1400](browserbase/stagehand#1400)
[`34e7e5b`](browserbase/stagehand@34e7e5b)
Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - don't write
base64 encoded screenshots to disk when caching agent actions

- [#1345](browserbase/stagehand#1345)
[`943d2d7`](browserbase/stagehand@943d2d7)
Thanks [@tkattkat](https://github.com/tkattkat)! - Add support for
aborting / stopping an agent run & continuing an agent run using
messages from prior runs

- [#1334](browserbase/stagehand#1334)
[`0e95cd2`](browserbase/stagehand@0e95cd2)
Thanks [@tkattkat](https://github.com/tkattkat)! - Add support for
google vertex provider

- [#1410](browserbase/stagehand#1410)
[`d4237e4`](browserbase/stagehand@d4237e4)
Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - fix:
include extract in stagehand.history()

- [#1315](browserbase/stagehand#1315)
[`86975e7`](browserbase/stagehand@86975e7)
Thanks [@tkattkat](https://github.com/tkattkat)! - Add streaming support
to agent through stream:true in the agent config

- [#1304](browserbase/stagehand#1304)
[`d5e119b`](browserbase/stagehand@d5e119b)
Thanks [@miguelg719](https://github.com/miguelg719)! - Add support for
Microsoft's Fara-7B

- [#1346](browserbase/stagehand#1346)
[`4e051b2`](browserbase/stagehand@4e051b2)
Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - fix: don't
attach to targets twice

- [#1327](browserbase/stagehand#1327)
[`6b5a3c9`](browserbase/stagehand@6b5a3c9)
Thanks [@miguelg719](https://github.com/miguelg719)! - Informed error
parsing from api

- [#1335](browserbase/stagehand#1335)
[`bb85ad9`](browserbase/stagehand@bb85ad9)
Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - add support
for page.addInitScript()

- [#1331](browserbase/stagehand#1331)
[`88d28cc`](browserbase/stagehand@88d28cc)
Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - fix:
page.evaluate() now works with scripts injected via
context.addInitScript()

- [#1316](browserbase/stagehand#1316)
[`45bcef0`](browserbase/stagehand@45bcef0)
Thanks [@tkattkat](https://github.com/tkattkat)! - Add support for
callbacks in stagehand agent

- [#1374](browserbase/stagehand#1374)
[`6aa9d45`](browserbase/stagehand@6aa9d45)
Thanks [@miguelg719](https://github.com/miguelg719)! - Fix key action
mapping in Anthropic CUA

- [#1330](browserbase/stagehand#1330)
[`d382084`](browserbase/stagehand@d382084)
Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - fix: make
act, extract, and observe respect user defined timeout param

- [#1336](browserbase/stagehand#1336)
[`1df08cc`](browserbase/stagehand@1df08cc)
Thanks [@tkattkat](https://github.com/tkattkat)! - Patch agent on api

- [#1358](browserbase/stagehand#1358)
[`2b56600`](browserbase/stagehand@2b56600)
Thanks [@tkattkat](https://github.com/tkattkat)! - Add support for 4.5
opus in cua agent

## @browserbasehq/[email protected]

### Patch Changes

- [#1364](browserbase/stagehand#1364)
[`ca0630e`](browserbase/stagehand@ca0630e)
Thanks [@tkattkat](https://github.com/tkattkat)! - Update model handling
in agent evals cli

- Updated dependencies
\[[`605ed6b`](browserbase/stagehand@605ed6b),
[`34e7e5b`](browserbase/stagehand@34e7e5b),
[`943d2d7`](browserbase/stagehand@943d2d7),
[`0e95cd2`](browserbase/stagehand@0e95cd2),
[`d4237e4`](browserbase/stagehand@d4237e4),
[`86975e7`](browserbase/stagehand@86975e7),
[`d5e119b`](browserbase/stagehand@d5e119b),
[`4e051b2`](browserbase/stagehand@4e051b2),
[`6b5a3c9`](browserbase/stagehand@6b5a3c9),
[`bb85ad9`](browserbase/stagehand@bb85ad9),
[`88d28cc`](browserbase/stagehand@88d28cc),
[`45bcef0`](browserbase/stagehand@45bcef0),
[`6aa9d45`](browserbase/stagehand@6aa9d45),
[`d382084`](browserbase/stagehand@d382084),
[`1df08cc`](browserbase/stagehand@1df08cc),
[`2b56600`](browserbase/stagehand@2b56600)]:
    -   @browserbasehq/[email protected]

## @browserbasehq/[email protected]

### Patch Changes

- Updated dependencies
\[[`605ed6b`](browserbase/stagehand@605ed6b),
[`34e7e5b`](browserbase/stagehand@34e7e5b),
[`943d2d7`](browserbase/stagehand@943d2d7),
[`0e95cd2`](browserbase/stagehand@0e95cd2),
[`d4237e4`](browserbase/stagehand@d4237e4),
[`86975e7`](browserbase/stagehand@86975e7),
[`d5e119b`](browserbase/stagehand@d5e119b),
[`4e051b2`](browserbase/stagehand@4e051b2),
[`6b5a3c9`](browserbase/stagehand@6b5a3c9),
[`bb85ad9`](browserbase/stagehand@bb85ad9),
[`88d28cc`](browserbase/stagehand@88d28cc),
[`45bcef0`](browserbase/stagehand@45bcef0),
[`6aa9d45`](browserbase/stagehand@6aa9d45),
[`d382084`](browserbase/stagehand@d382084),
[`1df08cc`](browserbase/stagehand@1df08cc),
[`2b56600`](browserbase/stagehand@2b56600)]:
    -   @browserbasehq/[email protected]

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

observe These changes pertain to the observe function

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants