-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Update the agent evals cli #1364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🦋 Changeset detectedLatest commit: 13b906c The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
Greptile OverviewGreptile SummaryThis PR fixes model handling in the agent evals CLI to properly support provider-prefixed model names (e.g., Key Changes:
Technical Details: Confidence Score: 5/5
Important Files ChangedFile Analysis
Sequence DiagramsequenceDiagram
participant CLI as Eval CLI
participant Config as taskConfig.ts
participant IndexEval as index.eval.ts
participant InitV3 as initV3.ts
participant AgentProvider as AgentProvider.ts
CLI->>Config: getAgentModelEntries()
Config-->>CLI: [{modelName, cua: true/false}]
CLI->>IndexEval: Generate test cases
Note over IndexEval: For agent categories
IndexEval->>Config: getAgentModelEntries()
Config-->>IndexEval: AGENT_MODEL_ENTRIES
loop Each model entry
IndexEval->>IndexEval: Create testcase with isCUA flag
Note over IndexEval: input: {name, modelName, isCUA}
end
IndexEval->>InitV3: initV3({modelName, isCUA, ...})
alt isCUA is true
InitV3->>InitV3: Extract short model name
Note over InitV3: "anthropic/model" → "model"
InitV3->>AgentProvider: Lookup in modelToAgentProviderMap
AgentProvider-->>InitV3: Provider type (openai/anthropic/google)
InitV3->>InitV3: Create agent with cua: true
else isCUA is false
InitV3->>InitV3: Create agent without CUA
end
InitV3-->>IndexEval: V3InitResult with agent
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 files reviewed, no comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 issue found across 5 files
Prompt for AI agents (all 1 issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="packages/evals/taskConfig.ts">
<violation number="1" location="packages/evals/taskConfig.ts:118">
P2: The model `anthropic/claude-sonnet-4-20250514` appears in both `AGENT_MODELS` and `AGENT_MODELS_CUA`, causing `DEFAULT_AGENT_MODELS` to contain duplicate entries. If this is intentional (to test the same model with both CUA modes), consider adding a comment. Otherwise, remove it from one of the arrays to avoid duplicate test runs.</violation>
</file>
Reply to cubic to teach it or ask questions. Re-run a review with @cubic-dev-ai review this PR
This PR was opened by the [Changesets release](https://github.com/changesets/action) GitHub action. When you're ready to do a release, you can merge this and the packages will be published to npm automatically. If you're not ready to do a release yet, that's fine, whenever you add more changesets to main, this PR will be updated. # Releases ## @browserbasehq/[email protected] ### Patch Changes - [#1388](#1388) [`605ed6b`](605ed6b) Thanks [@miguelg719](https://github.com/miguelg719)! - Fix multiple click event dispatches on CDP and Anthropic CUA handling (double clicks) - [#1400](#1400) [`34e7e5b`](34e7e5b) Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - don't write base64 encoded screenshots to disk when caching agent actions - [#1345](#1345) [`943d2d7`](943d2d7) Thanks [@tkattkat](https://github.com/tkattkat)! - Add support for aborting / stopping an agent run & continuing an agent run using messages from prior runs - [#1334](#1334) [`0e95cd2`](0e95cd2) Thanks [@tkattkat](https://github.com/tkattkat)! - Add support for google vertex provider - [#1410](#1410) [`d4237e4`](d4237e4) Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - fix: include extract in stagehand.history() - [#1315](#1315) [`86975e7`](86975e7) Thanks [@tkattkat](https://github.com/tkattkat)! - Add streaming support to agent through stream:true in the agent config - [#1304](#1304) [`d5e119b`](d5e119b) Thanks [@miguelg719](https://github.com/miguelg719)! - Add support for Microsoft's Fara-7B - [#1346](#1346) [`4e051b2`](4e051b2) Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - fix: don't attach to targets twice - [#1327](#1327) [`6b5a3c9`](6b5a3c9) Thanks [@miguelg719](https://github.com/miguelg719)! - Informed error parsing from api - [#1335](#1335) [`bb85ad9`](bb85ad9) Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - add support for page.addInitScript() - [#1331](#1331) [`88d28cc`](88d28cc) Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - fix: page.evaluate() now works with scripts injected via context.addInitScript() - [#1316](#1316) [`45bcef0`](45bcef0) Thanks [@tkattkat](https://github.com/tkattkat)! - Add support for callbacks in stagehand agent - [#1374](#1374) [`6aa9d45`](6aa9d45) Thanks [@miguelg719](https://github.com/miguelg719)! - Fix key action mapping in Anthropic CUA - [#1330](#1330) [`d382084`](d382084) Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - fix: make act, extract, and observe respect user defined timeout param - [#1336](#1336) [`1df08cc`](1df08cc) Thanks [@tkattkat](https://github.com/tkattkat)! - Patch agent on api - [#1358](#1358) [`2b56600`](2b56600) Thanks [@tkattkat](https://github.com/tkattkat)! - Add support for 4.5 opus in cua agent ## @browserbasehq/[email protected] ### Patch Changes - [#1364](#1364) [`ca0630e`](ca0630e) Thanks [@tkattkat](https://github.com/tkattkat)! - Update model handling in agent evals cli - Updated dependencies \[[`605ed6b`](605ed6b), [`34e7e5b`](34e7e5b), [`943d2d7`](943d2d7), [`0e95cd2`](0e95cd2), [`d4237e4`](d4237e4), [`86975e7`](86975e7), [`d5e119b`](d5e119b), [`4e051b2`](4e051b2), [`6b5a3c9`](6b5a3c9), [`bb85ad9`](bb85ad9), [`88d28cc`](88d28cc), [`45bcef0`](45bcef0), [`6aa9d45`](6aa9d45), [`d382084`](d382084), [`1df08cc`](1df08cc), [`2b56600`](2b56600)]: - @browserbasehq/[email protected] ## @browserbasehq/[email protected] ### Patch Changes - Updated dependencies \[[`605ed6b`](605ed6b), [`34e7e5b`](34e7e5b), [`943d2d7`](943d2d7), [`0e95cd2`](0e95cd2), [`d4237e4`](d4237e4), [`86975e7`](86975e7), [`d5e119b`](d5e119b), [`4e051b2`](4e051b2), [`6b5a3c9`](6b5a3c9), [`bb85ad9`](bb85ad9), [`88d28cc`](88d28cc), [`45bcef0`](45bcef0), [`6aa9d45`](6aa9d45), [`d382084`](d382084), [`1df08cc`](1df08cc), [`2b56600`](2b56600)]: - @browserbasehq/[email protected] Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
This PR was opened by the [Changesets release](https://github.com/changesets/action) GitHub action. When you're ready to do a release, you can merge this and the packages will be published to npm automatically. If you're not ready to do a release yet, that's fine, whenever you add more changesets to main, this PR will be updated. # Releases ## @browserbasehq/[email protected] ### Patch Changes - [#1388](browserbase/stagehand#1388) [`605ed6b`](browserbase/stagehand@605ed6b) Thanks [@miguelg719](https://github.com/miguelg719)! - Fix multiple click event dispatches on CDP and Anthropic CUA handling (double clicks) - [#1400](browserbase/stagehand#1400) [`34e7e5b`](browserbase/stagehand@34e7e5b) Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - don't write base64 encoded screenshots to disk when caching agent actions - [#1345](browserbase/stagehand#1345) [`943d2d7`](browserbase/stagehand@943d2d7) Thanks [@tkattkat](https://github.com/tkattkat)! - Add support for aborting / stopping an agent run & continuing an agent run using messages from prior runs - [#1334](browserbase/stagehand#1334) [`0e95cd2`](browserbase/stagehand@0e95cd2) Thanks [@tkattkat](https://github.com/tkattkat)! - Add support for google vertex provider - [#1410](browserbase/stagehand#1410) [`d4237e4`](browserbase/stagehand@d4237e4) Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - fix: include extract in stagehand.history() - [#1315](browserbase/stagehand#1315) [`86975e7`](browserbase/stagehand@86975e7) Thanks [@tkattkat](https://github.com/tkattkat)! - Add streaming support to agent through stream:true in the agent config - [#1304](browserbase/stagehand#1304) [`d5e119b`](browserbase/stagehand@d5e119b) Thanks [@miguelg719](https://github.com/miguelg719)! - Add support for Microsoft's Fara-7B - [#1346](browserbase/stagehand#1346) [`4e051b2`](browserbase/stagehand@4e051b2) Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - fix: don't attach to targets twice - [#1327](browserbase/stagehand#1327) [`6b5a3c9`](browserbase/stagehand@6b5a3c9) Thanks [@miguelg719](https://github.com/miguelg719)! - Informed error parsing from api - [#1335](browserbase/stagehand#1335) [`bb85ad9`](browserbase/stagehand@bb85ad9) Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - add support for page.addInitScript() - [#1331](browserbase/stagehand#1331) [`88d28cc`](browserbase/stagehand@88d28cc) Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - fix: page.evaluate() now works with scripts injected via context.addInitScript() - [#1316](browserbase/stagehand#1316) [`45bcef0`](browserbase/stagehand@45bcef0) Thanks [@tkattkat](https://github.com/tkattkat)! - Add support for callbacks in stagehand agent - [#1374](browserbase/stagehand#1374) [`6aa9d45`](browserbase/stagehand@6aa9d45) Thanks [@miguelg719](https://github.com/miguelg719)! - Fix key action mapping in Anthropic CUA - [#1330](browserbase/stagehand#1330) [`d382084`](browserbase/stagehand@d382084) Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - fix: make act, extract, and observe respect user defined timeout param - [#1336](browserbase/stagehand#1336) [`1df08cc`](browserbase/stagehand@1df08cc) Thanks [@tkattkat](https://github.com/tkattkat)! - Patch agent on api - [#1358](browserbase/stagehand#1358) [`2b56600`](browserbase/stagehand@2b56600) Thanks [@tkattkat](https://github.com/tkattkat)! - Add support for 4.5 opus in cua agent ## @browserbasehq/[email protected] ### Patch Changes - [#1364](browserbase/stagehand#1364) [`ca0630e`](browserbase/stagehand@ca0630e) Thanks [@tkattkat](https://github.com/tkattkat)! - Update model handling in agent evals cli - Updated dependencies \[[`605ed6b`](browserbase/stagehand@605ed6b), [`34e7e5b`](browserbase/stagehand@34e7e5b), [`943d2d7`](browserbase/stagehand@943d2d7), [`0e95cd2`](browserbase/stagehand@0e95cd2), [`d4237e4`](browserbase/stagehand@d4237e4), [`86975e7`](browserbase/stagehand@86975e7), [`d5e119b`](browserbase/stagehand@d5e119b), [`4e051b2`](browserbase/stagehand@4e051b2), [`6b5a3c9`](browserbase/stagehand@6b5a3c9), [`bb85ad9`](browserbase/stagehand@bb85ad9), [`88d28cc`](browserbase/stagehand@88d28cc), [`45bcef0`](browserbase/stagehand@45bcef0), [`6aa9d45`](browserbase/stagehand@6aa9d45), [`d382084`](browserbase/stagehand@d382084), [`1df08cc`](browserbase/stagehand@1df08cc), [`2b56600`](browserbase/stagehand@2b56600)]: - @browserbasehq/[email protected] ## @browserbasehq/[email protected] ### Patch Changes - Updated dependencies \[[`605ed6b`](browserbase/stagehand@605ed6b), [`34e7e5b`](browserbase/stagehand@34e7e5b), [`943d2d7`](browserbase/stagehand@943d2d7), [`0e95cd2`](browserbase/stagehand@0e95cd2), [`d4237e4`](browserbase/stagehand@d4237e4), [`86975e7`](browserbase/stagehand@86975e7), [`d5e119b`](browserbase/stagehand@d5e119b), [`4e051b2`](browserbase/stagehand@4e051b2), [`6b5a3c9`](browserbase/stagehand@6b5a3c9), [`bb85ad9`](browserbase/stagehand@bb85ad9), [`88d28cc`](browserbase/stagehand@88d28cc), [`45bcef0`](browserbase/stagehand@45bcef0), [`6aa9d45`](browserbase/stagehand@6aa9d45), [`d382084`](browserbase/stagehand@d382084), [`1df08cc`](browserbase/stagehand@1df08cc), [`2b56600`](browserbase/stagehand@2b56600)]: - @browserbasehq/[email protected] Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
why
After the transition to v3, the model handling for agent evals was not updated to account for new model formats
what changed
test plan
Summary by cubic
Updated the agent evals CLI to support and correctly run both CUA and non-CUA agent models in v3. Fixes agent model parsing and enables mixed eval runs.
Written for commit 13b906c. Summary will update automatically on new commits.