Skip to content

Commit d870677

Browse files
committed
feat(web-evals): remember last Roo model selection
1 parent 424bce6 commit d870677

File tree

6 files changed

+416
-3
lines changed

6 files changed

+416
-3
lines changed

.roo/skills/evals-context/SKILL.md

Lines changed: 188 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,188 @@
1+
---
2+
name: evals-context
3+
description: Provides context about the Roo Code evals system structure in this monorepo. Use when tasks mention "evals", "evaluation", "eval runs", "eval exercises", or working with the evals infrastructure. Helps distinguish between the evals execution system (packages/evals, apps/web-evals) and the public website evals display page (apps/web-roo-code/src/app/evals).
4+
---
5+
6+
# Evals Codebase Context
7+
8+
## When to Use This Skill
9+
10+
Use this skill when the task involves:
11+
12+
- Modifying or debugging the evals execution infrastructure
13+
- Adding new eval exercises or languages
14+
- Working with the evals web interface (apps/web-evals)
15+
- Modifying the public evals display page on roocode.com
16+
- Understanding where evals code lives in this monorepo
17+
18+
## When NOT to Use This Skill
19+
20+
Do NOT use this skill when:
21+
22+
- Working on unrelated parts of the codebase (extension, webview-ui, etc.)
23+
- The task is purely about the VS Code extension's core functionality
24+
- Working on the main website pages that don't involve evals
25+
26+
## Key Disambiguation: Two "Evals" Locations
27+
28+
This monorepo has **two distinct evals-related locations** that can cause confusion:
29+
30+
| Component | Path | Purpose |
31+
| --------------------------- | -------------------------------------------------------------- | -------------------------------------------------------------- |
32+
| **Evals Execution System** | `packages/evals/` | Core eval infrastructure: CLI, DB schema, Docker configs |
33+
| **Evals Management UI** | `apps/web-evals/` | Next.js app for creating/monitoring eval runs (localhost:3446) |
34+
| **Website Evals Page** | `apps/web-roo-code/src/app/evals/` | Public roocode.com page displaying eval results |
35+
| **External Exercises Repo** | [Roo-Code-Evals](https://github.com/RooCodeInc/Roo-Code-Evals) | Actual coding exercises (NOT in this monorepo) |
36+
37+
## Directory Structure Reference
38+
39+
### `packages/evals/` - Core Evals Package
40+
41+
```
42+
packages/evals/
43+
├── ARCHITECTURE.md # Detailed architecture documentation
44+
├── ADDING-EVALS.md # Guide for adding new exercises/languages
45+
├── README.md # Setup and running instructions
46+
├── docker-compose.yml # Container orchestration
47+
├── Dockerfile.runner # Runner container definition
48+
├── Dockerfile.web # Web app container
49+
├── drizzle.config.ts # Database ORM config
50+
├── src/
51+
│ ├── index.ts # Package exports
52+
│ ├── cli/ # CLI commands for running evals
53+
│ │ ├── runEvals.ts # Orchestrates complete eval runs
54+
│ │ ├── runTask.ts # Executes individual tasks in containers
55+
│ │ ├── runUnitTest.ts # Validates task completion via tests
56+
│ │ └── redis.ts # Redis pub/sub integration
57+
│ ├── db/
58+
│ │ ├── schema.ts # Database schema (runs, tasks)
59+
│ │ ├── queries/ # Database query functions
60+
│ │ └── migrations/ # SQL migrations
61+
│ └── exercises/
62+
│ └── index.ts # Exercise loading utilities
63+
└── scripts/
64+
└── setup.sh # Local macOS setup script
65+
```
66+
67+
### `apps/web-evals/` - Evals Management Web App
68+
69+
```
70+
apps/web-evals/
71+
├── src/
72+
│ ├── app/
73+
│ │ ├── page.tsx # Home page (runs list)
74+
│ │ ├── runs/
75+
│ │ │ ├── new/ # Create new eval run
76+
│ │ │ └── [id]/ # View specific run status
77+
│ │ └── api/runs/ # SSE streaming endpoint
78+
│ ├── actions/ # Server actions
79+
│ │ ├── runs.ts # Run CRUD operations
80+
│ │ ├── tasks.ts # Task queries
81+
│ │ ├── exercises.ts # Exercise listing
82+
│ │ └── heartbeat.ts # Controller health checks
83+
│ ├── hooks/ # React hooks (SSE, models, etc.)
84+
│ └── lib/ # Utilities and schemas
85+
```
86+
87+
### `apps/web-roo-code/src/app/evals/` - Public Website Evals Page
88+
89+
```
90+
apps/web-roo-code/src/app/evals/
91+
├── page.tsx # Fetches and displays public eval results
92+
├── evals.tsx # Main evals display component
93+
├── plot.tsx # Visualization component
94+
└── types.ts # EvalRun type (extends packages/evals types)
95+
```
96+
97+
This page **displays** eval results on the public roocode.com website. It imports types from `@roo-code/evals` but does NOT run evals.
98+
99+
## Architecture Overview
100+
101+
The evals system is a distributed evaluation platform that runs AI coding tasks in isolated VS Code environments:
102+
103+
```
104+
┌─────────────────────────────────────────────────────────────┐
105+
│ Web App (apps/web-evals) ──────────────────────────────── │
106+
│ │ │
107+
│ ▼ │
108+
│ PostgreSQL ◄────► Controller Container │
109+
│ │ │ │
110+
│ ▼ ▼ │
111+
│ Redis ◄───► Runner Containers (1-25 parallel) │
112+
└─────────────────────────────────────────────────────────────┘
113+
```
114+
115+
**Key components:**
116+
117+
- **Controller**: Orchestrates eval runs, spawns runners, manages task queue (p-queue)
118+
- **Runner**: Isolated Docker container with VS Code + Roo Code extension + language runtimes
119+
- **Redis**: Pub/sub for real-time events (NOT task queuing)
120+
- **PostgreSQL**: Stores runs, tasks, metrics
121+
122+
## Common Tasks Quick Reference
123+
124+
### Adding a New Eval Exercise
125+
126+
1. Add exercise to [Roo-Code-Evals](https://github.com/RooCodeInc/Roo-Code-Evals) repo (external)
127+
2. See [`packages/evals/ADDING-EVALS.md`](packages/evals/ADDING-EVALS.md) for structure
128+
129+
### Modifying Eval CLI Behavior
130+
131+
Edit files in [`packages/evals/src/cli/`](packages/evals/src/cli/):
132+
133+
- [`runEvals.ts`](packages/evals/src/cli/runEvals.ts) - Run orchestration
134+
- [`runTask.ts`](packages/evals/src/cli/runTask.ts) - Task execution
135+
- [`runUnitTest.ts`](packages/evals/src/cli/runUnitTest.ts) - Test validation
136+
137+
### Modifying the Evals Web Interface
138+
139+
Edit files in [`apps/web-evals/src/`](apps/web-evals/src/):
140+
141+
- [`app/runs/new/new-run.tsx`](apps/web-evals/src/app/runs/new/new-run.tsx) - New run form
142+
- [`actions/runs.ts`](apps/web-evals/src/actions/runs.ts) - Run server actions
143+
144+
### Modifying the Public Evals Display Page
145+
146+
Edit files in [`apps/web-roo-code/src/app/evals/`](apps/web-roo-code/src/app/evals/):
147+
148+
- [`evals.tsx`](apps/web-roo-code/src/app/evals/evals.tsx) - Display component
149+
- [`plot.tsx`](apps/web-roo-code/src/app/evals/plot.tsx) - Charts
150+
151+
### Database Schema Changes
152+
153+
1. Edit [`packages/evals/src/db/schema.ts`](packages/evals/src/db/schema.ts)
154+
2. Generate migration: `cd packages/evals && pnpm drizzle-kit generate`
155+
3. Apply migration: `pnpm drizzle-kit migrate`
156+
157+
## Running Evals Locally
158+
159+
```bash
160+
# From repo root
161+
pnpm evals
162+
163+
# Opens web UI at http://localhost:3446
164+
```
165+
166+
**Ports (defaults):**
167+
168+
- PostgreSQL: 5433
169+
- Redis: 6380
170+
- Web: 3446
171+
172+
## Testing
173+
174+
```bash
175+
# packages/evals tests
176+
cd packages/evals && npx vitest run
177+
178+
# apps/web-evals tests
179+
cd apps/web-evals && npx vitest run
180+
```
181+
182+
## Key Types/Exports from `@roo-code/evals`
183+
184+
The package exports are defined in [`packages/evals/src/index.ts`](packages/evals/src/index.ts):
185+
186+
- Database queries: `getRuns`, `getTasks`, `getTaskMetrics`, etc.
187+
- Schema types: `Run`, `Task`, `TaskMetrics`
188+
- Used by both `apps/web-evals` and `apps/web-roo-code`

apps/web-evals/src/app/runs/new/new-run.tsx

Lines changed: 56 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,9 @@ import {
4848
} from "@/lib/schemas"
4949
import { cn } from "@/lib/utils"
5050

51+
import { loadRooLastModelSelection, saveRooLastModelSelection } from "@/lib/roo-last-model-selection"
52+
import { normalizeCreateRunForSubmit } from "@/lib/normalize-create-run"
53+
5154
import { useOpenRouterModels } from "@/hooks/use-open-router-models"
5255
import { useRooCodeCloudModels } from "@/hooks/use-roo-code-cloud-models"
5356

@@ -147,6 +150,7 @@ export function NewRun() {
147150
})
148151

149152
const {
153+
register,
150154
setValue,
151155
clearErrors,
152156
watch,
@@ -155,6 +159,33 @@ export function NewRun() {
155159

156160
const [suite, settings] = watch(["suite", "settings", "concurrency"])
157161

162+
const selectedModelIds = useMemo(
163+
() => modelSelections.map((s) => s.model).filter((m) => m.length > 0),
164+
[modelSelections],
165+
)
166+
167+
const applyModelIds = useCallback(
168+
(modelIds: string[]) => {
169+
const unique = Array.from(new Set(modelIds.map((m) => m.trim()).filter((m) => m.length > 0)))
170+
171+
if (unique.length === 0) {
172+
setModelSelections([{ id: crypto.randomUUID(), model: "", popoverOpen: false }])
173+
setValue("model", "")
174+
return
175+
}
176+
177+
setModelSelections(unique.map((model) => ({ id: crypto.randomUUID(), model, popoverOpen: false })))
178+
setValue("model", unique[0] ?? "")
179+
},
180+
[setValue],
181+
)
182+
183+
// Ensure the `exercises` field is registered so RHF always includes it in submit values.
184+
useEffect(() => {
185+
register("exercises")
186+
}, [register])
187+
188+
// Load settings from localStorage on mount
158189
useEffect(() => {
159190
const savedConcurrency = localStorage.getItem("evals-concurrency")
160191

@@ -215,6 +246,24 @@ export function NewRun() {
215246
}
216247
}, [setValue])
217248

249+
// When switching to Roo provider, restore last-used selection if current selection is empty
250+
useEffect(() => {
251+
if (provider !== "roo") return
252+
if (selectedModelIds.length > 0) return
253+
254+
const last = loadRooLastModelSelection()
255+
if (last.length > 0) {
256+
applyModelIds(last)
257+
}
258+
}, [applyModelIds, provider, selectedModelIds.length])
259+
260+
// Persist last-used Roo provider model selection
261+
useEffect(() => {
262+
if (provider !== "roo") return
263+
saveRooLastModelSelection(selectedModelIds)
264+
}, [provider, selectedModelIds])
265+
266+
// Extract unique languages from exercises
218267
const languages = useMemo(() => {
219268
if (!exercises.data) {
220269
return []
@@ -337,7 +386,10 @@ export function NewRun() {
337386
const onSubmit = useCallback(
338387
async (values: CreateRun) => {
339388
try {
340-
if (provider === "roo" && !values.jobToken?.trim()) {
389+
const baseValues = normalizeCreateRunForSubmit(values, selectedExercises, suite)
390+
391+
// Validate jobToken for Roo Code Cloud provider
392+
if (provider === "roo" && !baseValues.jobToken?.trim()) {
341393
toast.error("Roo Code Cloud Token is required")
342394
return
343395
}
@@ -374,8 +426,7 @@ export function NewRun() {
374426
await new Promise((resolve) => setTimeout(resolve, 20_000))
375427
}
376428

377-
const runValues = { ...values }
378-
runValues.executionMethod = executionMethod
429+
const runValues = { ...baseValues }
379430

380431
if (provider === "openrouter") {
381432
runValues.model = selection.model
@@ -424,6 +475,8 @@ export function NewRun() {
424475
}
425476
},
426477
[
478+
suite,
479+
selectedExercises,
427480
provider,
428481
executionMethod,
429482
modelSelections,
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
import { normalizeCreateRunForSubmit } from "../normalize-create-run"
2+
3+
describe("normalizeCreateRunForSubmit", () => {
4+
it("uses selectedExercises for partial suite", () => {
5+
const result = normalizeCreateRunForSubmit(
6+
{
7+
model: "roo/model-a",
8+
description: "",
9+
suite: "partial",
10+
exercises: [],
11+
settings: undefined,
12+
concurrency: 1,
13+
timeout: 5,
14+
iterations: 1,
15+
jobToken: "",
16+
},
17+
["js/foo", "py/bar"],
18+
)
19+
20+
expect(result.suite).toBe("partial")
21+
expect(result.exercises).toEqual(["js/foo", "py/bar"])
22+
})
23+
24+
it("clears exercises for full suite", () => {
25+
const result = normalizeCreateRunForSubmit(
26+
{
27+
model: "roo/model-a",
28+
description: "",
29+
suite: "full",
30+
exercises: ["js/foo"],
31+
settings: undefined,
32+
concurrency: 1,
33+
timeout: 5,
34+
iterations: 1,
35+
jobToken: "",
36+
},
37+
["js/foo"],
38+
)
39+
40+
expect(result.suite).toBe("full")
41+
expect(result.exercises).toEqual([])
42+
})
43+
})
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
import {
2+
loadRooLastModelSelection,
3+
ROO_LAST_MODEL_SELECTION_KEY,
4+
saveRooLastModelSelection,
5+
} from "../roo-last-model-selection"
6+
7+
class LocalStorageMock implements Storage {
8+
private store = new Map<string, string>()
9+
10+
get length(): number {
11+
return this.store.size
12+
}
13+
14+
clear(): void {
15+
this.store.clear()
16+
}
17+
18+
getItem(key: string): string | null {
19+
return this.store.get(key) ?? null
20+
}
21+
22+
key(index: number): string | null {
23+
return Array.from(this.store.keys())[index] ?? null
24+
}
25+
26+
removeItem(key: string): void {
27+
this.store.delete(key)
28+
}
29+
30+
setItem(key: string, value: string): void {
31+
this.store.set(key, value)
32+
}
33+
}
34+
35+
beforeEach(() => {
36+
Object.defineProperty(globalThis, "localStorage", {
37+
value: new LocalStorageMock(),
38+
configurable: true,
39+
})
40+
})
41+
42+
describe("roo-last-model-selection", () => {
43+
it("saves and loads (deduped + trimmed)", () => {
44+
saveRooLastModelSelection([" roo/model-a ", "roo/model-a", "roo/model-b"])
45+
expect(loadRooLastModelSelection()).toEqual(["roo/model-a", "roo/model-b"])
46+
})
47+
48+
it("ignores invalid JSON", () => {
49+
localStorage.setItem(ROO_LAST_MODEL_SELECTION_KEY, "{this is not json")
50+
expect(loadRooLastModelSelection()).toEqual([])
51+
})
52+
53+
it("clears when empty", () => {
54+
localStorage.setItem(ROO_LAST_MODEL_SELECTION_KEY, JSON.stringify(["roo/model-a"]))
55+
saveRooLastModelSelection([])
56+
expect(localStorage.getItem(ROO_LAST_MODEL_SELECTION_KEY)).toBeNull()
57+
})
58+
})

0 commit comments

Comments
 (0)