A local web application with a chat interface for interacting with an LLM and a visualization panel that displays insights based on the responses.
Run conversations — after a few turns of a simulated eval, the right panel shows live mental-model scores and a chart of how they evolve:
Explore conversations — browse saved Spiral-Bench eval runs turn by turn with the mental model JSON per turn:
Chart view — average mental-model scores across 20 turns per scenario category; this run shows a clear upward trend in validation_seeking and user_rightness:
- 💬 Chat interface for LLM conversations
- 📊 Visualization panel with live mental-model scores and per-turn chart
- 🧠 Multiple mental model types: Induct, Support, Structured, Person Perception
- 🔁 Simulated eval runner (Spiral-Bench, 30 scenarios × 20 turns)
- 📂 Explore saved runs: chatlog and chart views
- 🔌 Azure OpenAI (GPT-4o), Google Gemini, and Llama (Vertex AI) support
- Install dependencies:
npm install- Start the development server:
npm run dev- Open your browser to the URL shown in the terminal (usually
http://localhost:5173)
The app supports Azure OpenAI (GPT-4o) and Google Gemini. You can switch the API model in the UI (dropdown “API model”) or in the terminal with --api_gemini or --api_gpt-4o when running npm run run_eval.
-
Copy the example environment file (if you have one) or create a
.envfile in the project root. -
Add your keys to
.env:
Azure OpenAI (GPT-4o):
VITE_AZURE_ENDPOINT=your-actual-endpoint-here
VITE_AZURE_API_KEY=your-actual-api-key-here
VITE_AZURE_DEPLOYMENT=gpt-4o
VITE_AZURE_API_VERSION=2024-12-01-preview
Google Gemini (optional; use when API model = Gemini):
- Browser / API key: Get an API key from Google AI Studio and set:
VITE_GEMINI_API_KEY=your-gemini-api-key-here - CLI / Node (Vertex AI with service account): Use a Google Cloud service account JSON key (like the one you pasted) and set:
Optional:
GOOGLE_APPLICATION_CREDENTIALS=./new_key.json # or another path to your JSON VITE_GEMINI_PROJECT_ID=your-gcp-project-id # optional if JSON has project_idVITE_GEMINI_LOCATION=us-central1,VITE_GEMINI_MODEL=gemini-1.5-flash(or e.g.gemini-2.5-pro).
Default provider (optional):
# VITE_API_PROVIDER=gpt-4o
# or
# VITE_API_PROVIDER=gemini
-
Replace placeholder values with your actual keys.
-
Restart the development server for the changes to take effect.
When you run evals (from the UI or via npm run run_eval), results are written under the data/ folder. Every saved JSON now includes api_model (e.g. "gpt-4o" or "gemini") so you know which API produced the run.
| Mode | Command / UI | Save location |
|---|---|---|
| Single-call | --single_call --model_induct (or --model_support_2) |
data/single_call/<model>/run_<api_model>_<N>/ → e.g. run_gemini_1, run_gpt-4o_2. One folder per run; inside it, one JSON per scenario (e.g. spiral_tropes/sc01.json). |
| Separate call | --separate_call --convo_<N> --model_induct |
data/separate_call/convo_<N>/<model>/run_<api_model>_<N>/ → same structure (e.g. run_gemini_1). |
| Generate convos | --generate_convo |
data/separate_call/convo_<N>/ → only user/assistant turns (no mental model); used later by separate_call. |
| Human data | --human_data --model_induct --filename do_not_upload/h01.json |
data/do_not_upload/<filename_no_ext>/<filename_no_ext>_<api_model>_<mental_model_type>.json → e.g. h01_gemini_induct.json, h01_gpt-4o_induct.json. One file per (source, api model, mental model); re-running overwrites/resumes that file. |
| Backfill empty | --backfill_empty --model_induct --file <path> |
Overwrites the given file, filling in missing mental models and updating meta.api_model. |
- Scenario runs (single_call / separate_call): Each file has
category,prompt_id,categoryInjection,extraInjection,api_model,turns, andsituation_log. Each turn hasturnIndex,userMessage,assistantMessage, andmentalModel. - Human data runs: Top-level
meta(includesapi_model,source,mentalModelType,turns_recorded_up_to, etc.) andturns(same turn shape as above).
So you can always see which API model was used for a run by checking api_model in the saved JSON.
Single-call runs: If a run fails mid-way (e.g. API error, rate limit, or OAuth error), re-run the same command with --resume_run <runId>. Use the same --api_*, --seed, and --prior (if you used them) as the original run. The script loads existing scenario JSONs from the run folder, skips scenarios that already have 20 turns, and continues from the first incomplete scenario (re-running that scenario from the beginning, then the rest).
Example (run folder run_gemini_3_prior):
npm run run_eval -- --single_call --model_induct --resume_run run_gemini_3_prior --api_gemini --priorIf you used a seed, add it (e.g. --seed 42). The run ID is the folder name under data/single_call/<model>/, e.g. run_gemini_3_prior, run_gpt-4o_2.
Human data runs: The CLI writes a checkpoint file after each turn. Re-run the same --human_data --filename ... command; it will detect the checkpoint and resume from the next turn.
├── src/
│ ├── components/
│ │ ├── ChatInterface.jsx # Chat UI
│ │ ├── ExploreConversations.jsx # Browse saved eval runs (chatlog + chart)
│ │ └── VisualizationPanel.jsx # Mental model scores + per-turn chart
│ ├── eval/
│ │ ├── categories.js # Spiral-Bench category injections
│ │ ├── default_prompt.js # Seeker LLM system prompt
│ │ ├── injections.js # Per-scenario extra injections
│ │ ├── mental_model_prompts.js # Prompt builders + response parsers
│ │ └── scenarios.js # Spiral-Bench scenario list
│ ├── services/
│ │ └── api.js # LLM API calls (Azure, Gemini, Llama)
│ ├── App.jsx # Root component + eval orchestration
│ └── main.jsx # Entry point
├── scripts/
│ ├── run_eval.js # CLI eval runner (npm run run_eval)
│ └── generate-spiral-manifest.js # Rebuild public data manifest
├── data/ # Saved eval run JSONs (gitignored)
├── screenshots/ # README screenshots
├── index.html
├── package.json
└── vite.config.js # Dev server + data middleware + build plugin
- Mental model types: Add or modify prompts and parsers in
src/eval/mental_model_prompts.js - Visualization Panel: Edit
VisualizationPanel.jsxto add new score series or chart types - Eval scenarios: Swap in different scenarios via
src/eval/scenarios.jsandsrc/eval/categories.js - API providers: Add new LLM backends in
src/services/api.js


