Commit ca0630e
authored
Update the agent evals cli (#1364)
# why
After the transition to v3, the model handling for agent evals was not
updated to account for new model formats
# what changed
- added isCua flag and two separate model maps to allow for models that
can be ran with cua and non
- adjusted model handling to properly parse cua models
- added tag to distinguish if the run is using cua or non
# test plan
- tested evals for cua, and non cua
<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Updated the agent evals CLI to support and correctly run both CUA and
non-CUA agent models in v3. Fixes agent model parsing and enables mixed
eval runs.
- **New Features**
- Split agent models into standard and CUA lists; added
getAgentModelEntries with a cua flag.
- Passed isCUA through EvalInput to initV3 and tasks; selects a safe
internal model for handlers when CUA.
- Improved provider lookup and error messages for CUA models using short
names; testcases now tag models as "cua" or "agent".
<sup>Written for commit 13b906c.
Summary will update automatically on new commits.</sup>
<!-- End of auto-generated description by cubic. -->1 parent 898e1f4 commit ca0630e
File tree
5 files changed
+77
-18
lines changed- .changeset
- packages/evals
- types
5 files changed
+77
-18
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
21 | 21 | | |
22 | 22 | | |
23 | 23 | | |
24 | | - | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
25 | 30 | | |
26 | 31 | | |
27 | 32 | | |
| |||
171 | 176 | | |
172 | 177 | | |
173 | 178 | | |
174 | | - | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
175 | 189 | | |
176 | | - | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
177 | 195 | | |
178 | 196 | | |
179 | | - | |
| 197 | + | |
| 198 | + | |
180 | 199 | | |
181 | 200 | | |
182 | 201 | | |
183 | 202 | | |
184 | 203 | | |
185 | 204 | | |
186 | | - | |
| 205 | + | |
187 | 206 | | |
188 | 207 | | |
189 | 208 | | |
| |||
344 | 363 | | |
345 | 364 | | |
346 | 365 | | |
| 366 | + | |
347 | 367 | | |
348 | 368 | | |
349 | 369 | | |
| |||
360 | 380 | | |
361 | 381 | | |
362 | 382 | | |
| 383 | + | |
363 | 384 | | |
364 | 385 | | |
365 | 386 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
32 | 32 | | |
33 | 33 | | |
34 | 34 | | |
| 35 | + | |
35 | 36 | | |
36 | 37 | | |
37 | 38 | | |
| |||
62 | 63 | | |
63 | 64 | | |
64 | 65 | | |
| 66 | + | |
65 | 67 | | |
66 | | - | |
67 | | - | |
68 | | - | |
69 | 68 | | |
70 | 69 | | |
71 | 70 | | |
| |||
130 | 129 | | |
131 | 130 | | |
132 | 131 | | |
133 | | - | |
134 | | - | |
135 | | - | |
136 | | - | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
137 | 145 | | |
138 | 146 | | |
139 | 147 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
| 17 | + | |
17 | 18 | | |
18 | 19 | | |
19 | 20 | | |
| |||
104 | 105 | | |
105 | 106 | | |
106 | 107 | | |
107 | | - | |
| 108 | + | |
| 109 | + | |
108 | 110 | | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
109 | 116 | | |
110 | | - | |
111 | | - | |
112 | | - | |
113 | | - | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
114 | 120 | | |
115 | 121 | | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
116 | 129 | | |
117 | 130 | | |
118 | 131 | | |
| |||
167 | 180 | | |
168 | 181 | | |
169 | 182 | | |
170 | | - | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
44 | 44 | | |
45 | 45 | | |
46 | 46 | | |
| 47 | + | |
47 | 48 | | |
48 | 49 | | |
49 | 50 | | |
| |||
83 | 84 | | |
84 | 85 | | |
85 | 86 | | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
0 commit comments