gbrain-evals

BrainBench — the public benchmark for personal knowledge agent stacks.

Scores four adapter configurations (gbrain, grep-only, vector RAG, gbrain-without-graph) side-by-side on a 240-page fictional-life corpus. Answers the question: "does the knowledge graph layer do useful work, or is gbrain just a thin wrapper over vector-grep-rrf-fusion retrieval?"

Headline on v0.12.1: gbrain P@5 49.1%, R@5 97.9% — beats its own graph-disabled variant by +31.4 points P@5, grep-only by 32 points, vector by 38 points. The graph layer is load-bearing.

Why a separate repo

Benchmark corpora (world-v1 + amara-life-v1 = ~4MB) shouldn't land in every gbrain install. This repo is what you clone when you want to run BrainBench against gbrain, not what you clone to use gbrain as a brain.

gbrain-evals depends on gbrain via the GitHub URL. When you bun install here, gbrain gets pulled in as a library. Evals call into gbrain's core modules (pglite-engine, operations, link-extraction, etc.) via the gbrain/* subpath exports.

5-minute quickstart

# Clone + install (pulls gbrain as a library dep)
git clone https://github.com/garrytan/gbrain-evals.git
cd gbrain-evals
bun install

# Run the full 4-adapter benchmark (N=5, ~15 min, no API keys required)
bun run eval:run

# Fast iteration (N=1)
bun run eval:run:dev

# Per-link-type accuracy report
bun run eval:type-accuracy

# Browse the fictional corpus
bun run eval:world:view

# Full BrainBench v1 scorecard (all Cats, published tier N=10)
bun run eval:brainbench:published       # ~$200 Opus baseline
bun run eval:brainbench                 # N=5 iteration (~$100)
bun run eval:brainbench:smoke           # N=1 smoke (~$22)

BrainBench Cat catalog

Cat	What it tests	Threshold	Status
1+2	Retrieval (relational queries over 240-page rich-prose)	P@5 > 0.39, R@5 > 0.83	shipping
2	Per-link-type accuracy on rich prose	type F1 per category	shipping
3	Identity resolution (aliases, handles, emails)	recall > 0.80	shipping
4	Temporal queries (as-of, point, range, recency)	as-of recall > 0.80	shipping
5	Source attribution / provenance (claim → source classification)	citation_accuracy > 0.90	shipping (programmatic)
6	Auto-link precision under prose (at scale)	link_precision > 0.95	shipping (baseline-only)
7	Performance / latency	p95 < 200ms per query	shipping
8	Skill behavior compliance (brain-first, back-link, citation, tier)	all > 0.90	shipping (programmatic)
9	End-to-end workflows (5 flows × rubric)	80% pass per workflow	shipping (programmatic)
10	Robustness / adversarial (22 hand-crafted cases)	100% pass, no crash	shipping
11	Multi-modal ingest (PDF + audio + HTML)	text > 0.95, WER < 0.15	shipping (opt-in fixtures)
12	MCP operation contract (trust boundary, input validation)	no silent corruption	shipping

Cats 5, 8, 9 are "programmatic" — they need runtime inputs (claim catalog, probe catalog, scenarios + agent state) and are invoked via their runCatN harness API rather than as standalone CLI scripts.

The fictional corpus: world-v1 + amara-life-v1

world-v1 (committed, 2.0MB): 240 Opus-generated biographical pages. 80 people, 80 companies, 50 meetings, 30 concepts. Each page carries _facts gold metadata that never crosses the adapter boundary (Day 9 sealed-qrels enforcement).

amara-life-v1 (committed, 2.1MB): Amara Okafor's messy week in April 2026. 50 emails + 300 Slack messages across 4 channels + 20 calendar events + 8 meeting transcripts + 40 first-person notes + 6 reference docs. Planted perturbations: 10 contradictions, 5 stale facts, 5 paraphrased- injection poison items, 3 implicit preferences.

Regenerate with bun run eval:generate-amara-life (requires ANTHROPIC_API_KEY, ~$4 Opus, ~15 min, deterministic from seed=42).

Repo layout

gbrain-evals/
├── eval/
│   ├── data/
│   │   ├── world-v1/                 240 committed biographical pages
│   │   ├── amara-life-v1/            Amara's fictional life (committed)
│   │   ├── gold/                     Sealed qrels + perturbation gold
│   │   └── multimodal/               PDF/audio/HTML fixtures (on-demand)
│   ├── schemas/                      Portable JSON Schema contracts
│   ├── generators/                   world.ts + amara-life.ts + Opus
│   ├── runner/                       12 Cat runners + adapters + judge
│   │   ├── adapters/                 grep-only, vector, vector-grep-rrf-fusion, claude-sonnet
│   │   ├── loaders/                  PDF + corpus loaders
│   │   ├── queries/                  Tier 5 fuzzy + 5.5 synthetic
│   │   ├── all.ts                    Master runner (p-limit(2) async fanout)
│   │   ├── cat{5,6,8,9,11}-*.ts      v1 Complete runners
│   │   ├── tool-bridge.ts            12 read + 3 dry_run tools
│   │   ├── judge.ts                  Haiku judge, structured evidence contract
│   │   ├── recorder.ts               6-artifact flight-recorder
│   │   └── llm-budget.ts             Shared Anthropic-call semaphore
│   └── cli/                          world-view, query-validate, query-new
├── test/eval/                        Unit tests (314 tests, 1354 expect calls)
└── docs/benchmarks/                  Committed scorecards per release

Three contributor paths

1. Reproduce a published scorecard

git checkout <commit-sha-from-scorecard>
bun run eval:run
# Match within tolerance bands (deterministic adapters byte-match)

2. Submit a new adapter

Implement eval/runner/adapters/<your-adapter>.ts against the Adapter interface (init(pages, config) → BrainState, query(q, state) → RankedDoc[]).
Register it in eval/runner/multi-adapter.ts.
Run bun run eval:run — it scores side-by-side against the 4 references.
Open a PR with your scorecard in docs/benchmarks/YYYY-MM-DD-<stack>.md.

3. Extend a Cat

Add a new Cat runner at eval/runner/catN-*.ts.
Wire into eval/runner/all.ts CATEGORIES.
Add tests at test/eval/catN.test.ts.
Commit a baseline to docs/benchmarks/.

Design doc + methodology

docs/benchmarks/TEMPLATE-brainbench-v1.md — scorecard format (coming in v1 Complete ship)
BrainBench v1 design doc: ~/.gstack/projects/garrytan-gbrain/garrytan-garrytan-gbrain-evals-design-20260418-081754.md (original)
3-axis metric framework: Retrieval (Cat 1-4), Ingestion (Cat 2, 6, 11), Assistant/personalization (Cat 5, 8, 9)
Anti-gaming: sealed qrels at the adapter boundary, N=3/5/10 tolerance bands, judge-version pinning, randomized query order per seeded run

License

MIT. Fixtures (world-v1, amara-life-v1) are fully fictional and redistributable.

Relationship to gbrain

gbrain-evals is a consumer of gbrain. The benchmark imports gbrain's public surface via gbrain/* subpath exports:

gbrain/operations — the 36 operations (tool-bridge exposes 12 read-only + 3 dry_run)
gbrain/pglite-engine — in-memory Postgres for adapter state
gbrain/link-extraction — extractor under test
gbrain/import-file, gbrain/embedding, gbrain/transcription — ingest pipeline
gbrain/search/vector-grep-rrf-fusion — vector-grep-rrf-fusion RAG implementation
gbrain/types, gbrain/config, gbrain/engine — type contracts

Any adapter that implements the Adapter interface can be scored — gbrain is one of many reference stacks, not the benchmark's subject.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
docs/benchmarks		docs/benchmarks
eval		eval
test/eval		test/eval
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gbrain-evals

Why a separate repo

5-minute quickstart

BrainBench Cat catalog

The fictional corpus: world-v1 + amara-life-v1

Repo layout

Three contributor paths

1. Reproduce a published scorecard

2. Submit a new adapter

3. Extend a Cat

Design doc + methodology

License

Relationship to gbrain

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

gbrain-evals

Why a separate repo

5-minute quickstart

BrainBench Cat catalog

The fictional corpus: world-v1 + amara-life-v1

Repo layout

Three contributor paths

1. Reproduce a published scorecard

2. Submit a new adapter

3. Extend a Cat

Design doc + methodology

License

Relationship to gbrain

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages