fix(extract+migrate): kill N+1 hang + v0.12.0 migration timeout (v0.12.1) by garrytan · Pull Request #198 · garrytan/gbrain

garrytan · 2026-04-18T16:14:38Z

Summary

Two production-blocking bugs Garry hit on his 47K-page brain on April 18.

Performance / Bug Fix:

gbrain extract no longer hangs 10+ minutes producing zero output on large brains. Removed the N+1 dedup pre-load that issued one engine.getLinks(slug) (or getTimeline) call per page across engine.listPages({limit: 100000}) — 47K serial round-trips over the Supabase pooler before the first file was read. Both engines already enforced uniqueness at the SQL layer; the in-memory dedup Set was redundant insurance that became the bottleneck.
All 4 extract paths (extractLinksFromDir, extractTimelineFromDir, extractLinksFromDB, extractTimelineFromDB) now buffer 100 candidates and flush via new batch engine methods. ~99% fewer DB round-trips per re-extract.
v0.12.0 schema migration no longer times out at Supabase Management API's 60s ceiling on duplicate-heavy brains. Migrations v8 (links) + v9 (timeline_entries) pre-create a btree helper index before the DELETE...USING self-join, then drop it. Turns O(n²) dedup into O(n log n). On 80K+ duplicate rows the migration completes in <1s.

New Engine API:

addLinksBatch(LinkBatchInput[]) → Promise<number> and addTimelineEntriesBatch(TimelineBatchInput[]) → Promise<number> on both PostgresEngine and PGLiteEngine. Multi-row INSERT...SELECT FROM unnest($1::text[], ...) JOIN pages ON CONFLICT DO NOTHING RETURNING 1 — 4 (links) or 5 (timeline) array-typed bound parameters regardless of batch size, sidestepping Postgres's 65535-parameter cap.
Per-row addLink / addTimelineEntry are unchanged. All 10 existing call sites compile and behave identically. Plugin authors can adopt the batch methods at their own pace.

Correctness:

created counter now truthful on re-runs (returns count of rows actually inserted via RETURNING 1, not "calls that didn't throw"). Re-run on a fully-extracted brain prints Done: 0 links instead of lying.
--dry-run deduplicates duplicate candidates across files (per-run Set only when dryRun=true). A link extracted from 3 files prints exactly once, matching what batch insert would actually create.
Whole-batch errors are visible in BOTH JSON and human modes — silent loss of 100 rows is worse than per-row error visibility.

Operational:

v0.12.0 orchestrator's phaseASchema timeout bumped 60s → 600s as belt-and-suspenders for unforeseen slowness.

Test Coverage

COVERAGE: 14/16 paths tested (87.5%)
QUALITY: ★★★ on all new code paths

[+] src/core/{pglite,postgres}-engine.ts — addLinksBatch + addTimelineEntriesBatch
    11 PGLite cases (test/pglite-engine.test.ts) covering: empty batch,
    missing optionals normalize to '', within-batch dedup via ON CONFLICT,
    missing-slug rows dropped by JOIN, half-existing batch returns count
    of new only, batch of 100 fresh rows.
    9 E2E cases (test/e2e/mechanical.test.ts) covering the same invariants
    against real Postgres+pgvector — postgres-js sql() helper has different
    semantics than PGLite's $N placeholders, so direct coverage is essential.

[+] src/commands/extract.ts — 4 refactored functions
    test/extract-fs.test.ts (new) covers FS-source idempotency + truthful
    counter + dry-run dedup across 3 files + perf regression guard <2s.
    Existing test/extract-db.test.ts idempotency tests at :78-88 and :166-175
    continue to pass — no behavior change for the per-row engine path.

[+] src/core/migrate.ts — v8 + v9 helper-btree fix
    Two structural assertions verify the migration SQL literally contains
    CREATE INDEX IF NOT EXISTS ... DROP INDEX IF EXISTS in the right order
    around the DELETE...USING. Catches a regression even at 0-row scale
    where wall-clock can't distinguish O(n²) from O(1).
    Two behavioral tests: 1000-row dedup completes <5s on PGLite for both v8 + v9.

GAPS: 2 paths (acceptable):
- Batch error path fault injection — no test simulates DB connection drop
  mid-flush. Acceptable: error logging is informational, no recovery logic.
- phaseASchema timeout bump — one-line config change, not testable in
  isolation.

Tests: 1297 → 1412 unit pass (+115). 105 → 119 E2E pass (+14).

Pre-Landing Review

Pre-Landing Review: No issues found. (Pass 1 + Pass 2 of ~/.claude/skills/gstack/review/checklist.md)

SQL & Data Safety: clean (parameterized queries throughout, no string interpolation)
Race Conditions: clean (ON CONFLICT DO NOTHING handles concurrent insert races atomically)
LLM Output Trust Boundary: N/A (batch input from markdown parsing, not LLM)
Shell Injection / Async-Sync mixing: N/A (TypeScript)
Enum Completeness: clean (no new enum values)
Column/Field Name Safety: clean (all batch SQL columns match src/schema.sql:51-107)

Specialist Review

5 specialist subagents (testing, maintainability, performance, security, data-migration) ran in parallel against the diff. Real bugs caught:

[testing] Original Postgres batch SQL used postgres-js's sql(rows, 'col1', ...) helper inside (VALUES) AS v(...) JOIN syntax. Doesn't compose — escapeIdentifiers errored at runtime with "str.replace is not a function" because the helper interpreted column names as identifiers in a context that expected values. Caught by adding the 9 E2E tests, then rewrote both engines to use the cleaner unnest($1::text[], ...) array-parameter pattern. PGLite was rewritten for consistency too. All 73 mechanical E2E tests now pass.
[testing + data-migration MULTI-CONFIRMED] Original regression tests asserted wall-clock <5s on 1000-row dedup, but specialists verified that test passes with helper-index removed (1M comparisons = milliseconds, threshold never trips). Added structural assertions on the migration SQL text — deterministic, fast, catches the actual fix presence regardless of row count. Behavioral test stays as a sanity check.

Informational items (deferred):

DB-source extract still has a read N+1 (getAllSlugs() + per-slug getPage()). Filed as P1 TODO — separate fix needs a getPagesBatch(slugs[]) engine method or streaming cursor.
BATCH_SIZE = 100 is conservative (Postgres allows ~16K with 4 cols). Documented with comment; bumping is safe future work.
ANALYZE after CREATE INDEX could lock in the planner choice. Not load-bearing on PGLite or any reasonably-sized brain.
Lock contention during CREATE INDEX (no CONCURRENTLY). Accepted: short window (<1s on 80K rows), CONCURRENTLY illegal in transactions, schema migrations are upgrade-time.

Plan Completion

10/10 plan items DONE (per /Users/garrytan/.claude/plans/system-instruction-you-are-working-velvet-valley.md):

Engine batch interface + types ✓
Postgres-engine batch methods (rewritten with unnest after E2E caught the bug) ✓
PGLite engine batch methods (matched to unnest pattern) ✓
Migration v8 + v9 helper-btree fix ✓
v0.12.0 phaseASchema timeout bump ✓
extract.ts refactor (all 4 functions, dry-run dedup preserved) ✓
Migration regression tests (structural + behavioral) ✓
Engine batch tests (11 PGLite + 9 E2E postgres) ✓
test/extract-fs.test.ts ✓
CHANGELOG + migration file ✓ (this commit)

TODOS

Added: "Batch the DB-source extract read path" (P1) — companion fix for the read N+1 in extractLinksFromDB / extractTimelineFromDB. Needs a getPagesBatch(slugs) engine method.
No prior TODOs completed by this PR.

Documentation

Updated CLAUDE.md (commit 2b39e90) to reflect v0.12.1:

Key files: Added LinkBatchInput / TimelineBatchInput exports on src/core/engine.ts; documented the new batch methods on both engines with the actual SQL shape.
extract.ts description: Documented that the in-memory dedup pre-load is gone, candidates are buffered 100 at a time and flushed via the batch methods, and the created counter is truthful on re-runs.
Migration registry: Noted phaseASchema timeout bumped from 60s to 600s.
Testing section: Updated counts from 1297 unit / 105 E2E → 1412 unit / 119 E2E.

README.md, CONTRIBUTING.md, TODOS.md unchanged (no references to changed areas; deferred items already captured).

Test plan

All unit tests pass (bun test): 1412 pass, 0 fail
All E2E tests pass against pgvector container: 119 pass, 0 fail (~85s)
Migration regression tests guard the helper-index fix structurally + behaviorally
PostgresEngine batch methods directly tested via 9 E2E cases against real Postgres+pgvector
PGLite batch methods unit-tested with 11 cases
FS extract end-to-end test confirms idempotency + truthful counter + dry-run dedup + perf <2s
No regressions in existing extract-db.test.ts (idempotency tests at :78-88 and :166-175 still pass)
No regressions in existing pglite-engine.test.ts (78/78 including "upsert on same (from, to, type) updates context")

🤖 Generated with Claude Code

Multi-row INSERT...SELECT FROM unnest() JOIN pages ON CONFLICT DO NOTHING RETURNING 1. 4 array-typed bound parameters (links) or 5 (timeline) regardless of batch size, sidesteps Postgres's 65535-parameter cap. Returns count of rows actually inserted (excluding ON CONFLICT no-ops and JOIN-dropped rows whose slugs don't exist). Per-row addLink / addTimelineEntry signatures and SQL behavior unchanged. All 10 existing call sites compile and behave identically. Tests: 11 PGLite cases (empty batch, missing optionals, within-batch dedup, JOIN drops missing slug, half-existing batch, batch of 100) + 9 E2E postgres-engine cases against real Postgres+pgvector.

…hema timeout Production bug: v0.12.0 schema migration timed out at Supabase Management API's 60s ceiling on brains with 80K+ duplicate timeline rows. The DELETE...USING self-join was O(n²) without an index on the dedup columns. Fix: pre-create idx_links_dedup_helper / idx_timeline_dedup_helper on the dedup columns BEFORE the DELETE, drop after. Turns O(n²) into O(n log n). On 80K+ rows the migration completes in <1s instead of timing out. Also bumps the v0.12.0 orchestrator's phaseASchema timeout 60s -> 600s as belt-and-suspenders for unforeseen slowness. Exports MIGRATIONS for structural test assertions. Tests: 2 structural assertions (helper-index DDL must appear in v8/v9 SQL in the right order — catches regression even at 0-row scale) + 2 behavioral regression tests (1000-row dedup completes <5s).

Production bug: gbrain extract hung 10+ minutes producing zero output on 47K-page brains. The pre-load loop called engine.getLinks(slug) (or getTimeline) once per page across engine.listPages({limit: 100000}) — 47K serial round-trips over the Supabase pooler before the first file was read. Both engines already enforced uniqueness at the SQL layer (UNIQUE(from, to, link_type) on links, idx_timeline_dedup on timeline_entries). The in-memory dedup Set was redundant insurance that became the bottleneck. Fix: delete the pre-load entirely. Buffer 100 candidates per file walk, flush via engine.addLinksBatch / engine.addTimelineEntriesBatch. ~99% fewer DB round-trips per re-extract. Also fixes counter accuracy: 'created' now counts rows actually inserted (via batch RETURNING 1 row count). Re-run on a fully-extracted brain prints 'Done: 0 links' instead of lying. Dry-run mode keeps a per-run dedup Set so duplicate candidates from N markdown files print exactly once, not N times. Batch errors are visible in BOTH json and human modes — silent loss of 100 rows is worse than per-row error visibility. Tests: extract-fs.test.ts (idempotency + truthful counter + dry-run dedup + perf regression guard <2s).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Reflect what shipped in v0.12.1: - New engine methods addLinksBatch + addTimelineEntriesBatch (PGLite via unnest() + manual $N, postgres-engine via INSERT...SELECT FROM unnest($1::text[], ...) JOIN pages ON CONFLICT DO NOTHING). - extract.ts no longer pre-loads dedup set; candidates are buffered 100 at a time and flushed via the new batch methods. - v0.12.0 orchestrator phaseASchema timeout bumped 60s to 600s. - Test counts 1297 unit / 105 E2E to 1412 unit / 119 E2E. - New test/extract-fs.test.ts covers the N+1 regression guard. - BrainEngine method count 37/38 to 40. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

…rytan#198) Adds addLinksBatch / addTimelineEntriesBatch to the BrainEngine interface. Both engines (PGLite + Postgres) implement the batch as a single `INSERT ... SELECT FROM unnest(...) JOIN pages ON CONFLICT DO NOTHING RETURNING 1` statement — 4-5 array-typed bound parameters regardless of batch size, sidestepping the 65535-parameter cap and the postgres-js sql(rows, ...) helper's identifier-escape gotcha. pbrain extract (file-source) now flushes candidates 100 at a time via the batch API instead of one write per link/entry. On large brains this drops wall-clock from "tens of minutes" to "seconds" end-to-end. Schema changes required for ON CONFLICT (from_page_id, to_page_id, link_type) to match a real unique index: - links table UNIQUE constraint widened from 2 columns to 3 (from_page_id, to_page_id, link_type), matching upstream v0.10.3 knowledge-graph layer. addLink's ON CONFLICT clause updated in both engines. Fresh installs get the new constraint from schema.sql / pglite-schema.ts; existing installs pick it up via migration v8. - src/core/migrate.ts picks up upstream's v5-v10 schema migrations. v5-v7 (minion_jobs / inbox / attachments) create empty tables that this fork doesn't use; they're harmless and keep the schema forward-compatible with a future Minions integration. v8-v10 touch the links and timeline_entries tables we already have. Notes on deferred upstream work: - The DB-source `pbrain extract --source db` path remains deferred with the knowledge-graph layer (depends on src/core/link-extraction.ts which we haven't taken). File-source extract is what uses the new batch API. - Upstream's v0.12.0 Knowledge Graph auto-wire orchestrator is NOT registered (src/commands/migrations/v0_12_0.ts deleted). (cherry picked from commit 699db50d6cb3b96c3ba3cea8cd55c78f0f9c3bae) Co-Authored-By: Garry Tan <[email protected]>

garrytan and others added 5 commits April 19, 2026 00:08

chore: bump version + changelog (v0.12.1)

8588879

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

garrytan merged commit 699db50 into master Apr 18, 2026
4 checks passed

gopalpatel mentioned this pull request Apr 19, 2026

bun install -g github:... produces broken CLI (Aborted in PGLite callMain) + blocked postinstall hides migrations #218

Open

joedanz mentioned this pull request Apr 19, 2026

integrate: upstream security + reliability fixes (v0.10.1 → v0.12.3) joedanz/pbrain#31

Merged

7 tasks

ab0991-oss mentioned this pull request Apr 19, 2026

chore: v0.13.0 — merge upstream v0.12.3 + Action Brain + collector hardening ab0991-oss/gbrain#7

Merged

4 tasks

berncode mentioned this pull request Apr 27, 2026

v0.16.0+ migrations fail with column "source_id" does not exist when upgrading existing Postgres brain from v0.15.4 #463

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(extract+migrate): kill N+1 hang + v0.12.0 migration timeout (v0.12.1)#198

fix(extract+migrate): kill N+1 hang + v0.12.0 migration timeout (v0.12.1)#198
garrytan merged 5 commits intomasterfrom
garrytan/p0-wave

garrytan commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

garrytan commented Apr 18, 2026

Summary

Test Coverage

Pre-Landing Review

Specialist Review

Plan Completion

TODOS

Documentation

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant