openclaw-eval

Evaluate OpenClaw responses using multi-round conversations from txt files or LoCoMo JSON datasets.

Run

uv sync
uv run eval.py ingest ./locomo10_small.json --output output/trial.txt --tail "[remember what's said, keep existing memory]"
uv run eval.py qa ./locomo10_small.json --output output/answers.txt --count 10
export OPENAI_API_KEY=sk-proj-Bc..
uv run judge.py output/answers.txt.json
uv run judge.py output/answers.txt.json \
      --base-url https://ark.cn-beijing.volces.com/api/v3 \
      --token 024341c1-... \
      --model doubao-seed-2-0-pro-260215

Dataset: LoCoMo10

locomo10.json contains 10 conversation samples with:

Metric	Total
Conversations (samples)	10
Sessions	272
Messages	5,882
Images	910
QA pairs	1,986

Structure

locomo10.json  ->  list of samples
  sample
    ├── sample_id          # e.g. "conv-26"
    ├── conversation
    │   ├── speaker_a / speaker_b
    │   ├── session_N_date_time
    │   └── session_N      # list of messages
    │       └── { speaker, dia_id, text, img_url?, blip_caption?, query? }
    ├── qa                 # list of { question, answer, evidence, category }
    ├── event_summary
    ├── observation
    └── session_summary

Two Modes

`ingest` - Load conversations into openclaw

Sends conversation sessions to openclaw to build up memory/context. All sessions within a sample share one user UUID so context accumulates.

# Ingest specific sample, sessions 1-4
uv run python eval.py ingest ./locomo10.json --sample 0 --sessions 1-4

# Ingest with output log
uv run python eval.py ingest ./locomo10.json --sample 0 --sessions 1-4 --output ingest.txt

# Original txt mode
uv run python eval.py ingest example.txt --output output.txt

Ingest prints the user UUID for each sample:

=== Sample conv-26 ===
    user: f5d94f68-24aa-4bcc-a667-9ab11bf5edab
    4 session(s) to ingest

Each session is bundled into a single user message:

[group chat conversation]

Caroline: Hey Mel! Good to see you!

Melanie: Hey Caroline! What's up?

Caroline: The transgender stories were so inspiring!
https://example.com/image.jpg: a photo of a dog walking past a wall

[]

`qa` - Run QA evaluation

Sends QA questions to openclaw and records responses alongside expected answers. No ingestion - requires --user UUID from a prior ingest run.

# Run all QAs for sample 0 using user from ingest
uv run python eval.py qa ./locomo10.json --sample 0 --user f5d94f68-... --output qa_results.txt

# Run first 10 QAs only
uv run python eval.py qa ./locomo10.json --sample 0 --user f5d94f68-... --count 10 --output qa_results.txt

Output format:

=== [conv-26] Q1 Category 5 ===
[question] What did Caroline realize after her charity race?
[expected] self-care is important
[response] <openclaw's response>
[evidence] D2:3

Typical Workflow

# Step 1: ingest conversations (note the user UUID printed)
uv run python eval.py ingest ./locomo10.json --sample 0 --sessions 1-4

# Step 2: run QA using that user UUID
uv run python eval.py qa ./locomo10.json --sample 0 --user <UUID> --output qa_results.txt

CLI Options

Flag	Default	Description
`mode` (positional)	required	`ingest` or `qa`
`input` (positional)	required	Path to `.txt` or `.json` input file
`--output`	none	Path to output file. Omit to skip writing
`--base-url`	`http://127.0.0.1:18789`	OpenClaw gateway URL
`--token`	`xxx`	Auth token (or `OPENCLAW_GATEWAY_TOKEN` env var)
`--sample`	all	LoCoMo: sample index (0-based)
`--sessions`	all	Ingest: session range, e.g. `1-4` or `3`
`--tail`	`[]`	Ingest: tail message appended per session
`--user`	none	QA: user UUID from a prior ingest run (required)
`--count`	all	QA: number of questions to run

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
docs		docs
output		output
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
eval.py		eval.py
judge.py		judge.py
judge_util.py		judge_util.py
locomo10.json		locomo10.json
locomo10_small.json		locomo10_small.json
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

openclaw-eval

Run

Dataset: LoCoMo10

Structure

Two Modes

`ingest` - Load conversations into openclaw

`qa` - Run QA evaluation

Typical Workflow

CLI Options

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

openclaw-eval

Run

Dataset: LoCoMo10

Structure

Two Modes

ingest - Load conversations into openclaw

qa - Run QA evaluation

Typical Workflow

CLI Options

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

`ingest` - Load conversations into openclaw

`qa` - Run QA evaluation

Packages