Skip to content

fix(extract+migrate): kill N+1 hang + v0.12.0 migration timeout (v0.12.1)#198

Merged
garrytan merged 5 commits intomasterfrom
garrytan/p0-wave
Apr 18, 2026
Merged

fix(extract+migrate): kill N+1 hang + v0.12.0 migration timeout (v0.12.1)#198
garrytan merged 5 commits intomasterfrom
garrytan/p0-wave

Conversation

@garrytan
Copy link
Copy Markdown
Owner

Summary

Two production-blocking bugs Garry hit on his 47K-page brain on April 18.

Performance / Bug Fix:

  • gbrain extract no longer hangs 10+ minutes producing zero output on large brains. Removed the N+1 dedup pre-load that issued one engine.getLinks(slug) (or getTimeline) call per page across engine.listPages({limit: 100000}) — 47K serial round-trips over the Supabase pooler before the first file was read. Both engines already enforced uniqueness at the SQL layer; the in-memory dedup Set was redundant insurance that became the bottleneck.
  • All 4 extract paths (extractLinksFromDir, extractTimelineFromDir, extractLinksFromDB, extractTimelineFromDB) now buffer 100 candidates and flush via new batch engine methods. ~99% fewer DB round-trips per re-extract.
  • v0.12.0 schema migration no longer times out at Supabase Management API's 60s ceiling on duplicate-heavy brains. Migrations v8 (links) + v9 (timeline_entries) pre-create a btree helper index before the DELETE...USING self-join, then drop it. Turns O(n²) dedup into O(n log n). On 80K+ duplicate rows the migration completes in <1s.

New Engine API:

  • addLinksBatch(LinkBatchInput[]) → Promise<number> and addTimelineEntriesBatch(TimelineBatchInput[]) → Promise<number> on both PostgresEngine and PGLiteEngine. Multi-row INSERT...SELECT FROM unnest($1::text[], ...) JOIN pages ON CONFLICT DO NOTHING RETURNING 1 — 4 (links) or 5 (timeline) array-typed bound parameters regardless of batch size, sidestepping Postgres's 65535-parameter cap.
  • Per-row addLink / addTimelineEntry are unchanged. All 10 existing call sites compile and behave identically. Plugin authors can adopt the batch methods at their own pace.

Correctness:

  • created counter now truthful on re-runs (returns count of rows actually inserted via RETURNING 1, not "calls that didn't throw"). Re-run on a fully-extracted brain prints Done: 0 links instead of lying.
  • --dry-run deduplicates duplicate candidates across files (per-run Set only when dryRun=true). A link extracted from 3 files prints exactly once, matching what batch insert would actually create.
  • Whole-batch errors are visible in BOTH JSON and human modes — silent loss of 100 rows is worse than per-row error visibility.

Operational:

  • v0.12.0 orchestrator's phaseASchema timeout bumped 60s → 600s as belt-and-suspenders for unforeseen slowness.

Test Coverage

COVERAGE: 14/16 paths tested (87.5%)
QUALITY: ★★★ on all new code paths

[+] src/core/{pglite,postgres}-engine.ts — addLinksBatch + addTimelineEntriesBatch
    11 PGLite cases (test/pglite-engine.test.ts) covering: empty batch,
    missing optionals normalize to '', within-batch dedup via ON CONFLICT,
    missing-slug rows dropped by JOIN, half-existing batch returns count
    of new only, batch of 100 fresh rows.
    9 E2E cases (test/e2e/mechanical.test.ts) covering the same invariants
    against real Postgres+pgvector — postgres-js sql() helper has different
    semantics than PGLite's $N placeholders, so direct coverage is essential.

[+] src/commands/extract.ts — 4 refactored functions
    test/extract-fs.test.ts (new) covers FS-source idempotency + truthful
    counter + dry-run dedup across 3 files + perf regression guard <2s.
    Existing test/extract-db.test.ts idempotency tests at :78-88 and :166-175
    continue to pass — no behavior change for the per-row engine path.

[+] src/core/migrate.ts — v8 + v9 helper-btree fix
    Two structural assertions verify the migration SQL literally contains
    CREATE INDEX IF NOT EXISTS ... DROP INDEX IF EXISTS in the right order
    around the DELETE...USING. Catches a regression even at 0-row scale
    where wall-clock can't distinguish O(n²) from O(1).
    Two behavioral tests: 1000-row dedup completes <5s on PGLite for both v8 + v9.

GAPS: 2 paths (acceptable):
- Batch error path fault injection — no test simulates DB connection drop
  mid-flush. Acceptable: error logging is informational, no recovery logic.
- phaseASchema timeout bump — one-line config change, not testable in
  isolation.

Tests: 1297 → 1412 unit pass (+115). 105 → 119 E2E pass (+14).

Pre-Landing Review

Pre-Landing Review: No issues found. (Pass 1 + Pass 2 of ~/.claude/skills/gstack/review/checklist.md)

  • SQL & Data Safety: clean (parameterized queries throughout, no string interpolation)
  • Race Conditions: clean (ON CONFLICT DO NOTHING handles concurrent insert races atomically)
  • LLM Output Trust Boundary: N/A (batch input from markdown parsing, not LLM)
  • Shell Injection / Async-Sync mixing: N/A (TypeScript)
  • Enum Completeness: clean (no new enum values)
  • Column/Field Name Safety: clean (all batch SQL columns match src/schema.sql:51-107)

Specialist Review

5 specialist subagents (testing, maintainability, performance, security, data-migration) ran in parallel against the diff. Real bugs caught:

  • [testing] Original Postgres batch SQL used postgres-js's sql(rows, 'col1', ...) helper inside (VALUES) AS v(...) JOIN syntax. Doesn't compose — escapeIdentifiers errored at runtime with "str.replace is not a function" because the helper interpreted column names as identifiers in a context that expected values. Caught by adding the 9 E2E tests, then rewrote both engines to use the cleaner unnest($1::text[], ...) array-parameter pattern. PGLite was rewritten for consistency too. All 73 mechanical E2E tests now pass.
  • [testing + data-migration MULTI-CONFIRMED] Original regression tests asserted wall-clock <5s on 1000-row dedup, but specialists verified that test passes with helper-index removed (1M comparisons = milliseconds, threshold never trips). Added structural assertions on the migration SQL text — deterministic, fast, catches the actual fix presence regardless of row count. Behavioral test stays as a sanity check.

Informational items (deferred):

  • DB-source extract still has a read N+1 (getAllSlugs() + per-slug getPage()). Filed as P1 TODO — separate fix needs a getPagesBatch(slugs[]) engine method or streaming cursor.
  • BATCH_SIZE = 100 is conservative (Postgres allows ~16K with 4 cols). Documented with comment; bumping is safe future work.
  • ANALYZE after CREATE INDEX could lock in the planner choice. Not load-bearing on PGLite or any reasonably-sized brain.
  • Lock contention during CREATE INDEX (no CONCURRENTLY). Accepted: short window (<1s on 80K rows), CONCURRENTLY illegal in transactions, schema migrations are upgrade-time.

Plan Completion

10/10 plan items DONE (per /Users/garrytan/.claude/plans/system-instruction-you-are-working-velvet-valley.md):

  1. Engine batch interface + types ✓
  2. Postgres-engine batch methods (rewritten with unnest after E2E caught the bug) ✓
  3. PGLite engine batch methods (matched to unnest pattern) ✓
  4. Migration v8 + v9 helper-btree fix ✓
  5. v0.12.0 phaseASchema timeout bump ✓
  6. extract.ts refactor (all 4 functions, dry-run dedup preserved) ✓
  7. Migration regression tests (structural + behavioral) ✓
  8. Engine batch tests (11 PGLite + 9 E2E postgres) ✓
  9. test/extract-fs.test.ts ✓
  10. CHANGELOG + migration file ✓ (this commit)

TODOS

  • Added: "Batch the DB-source extract read path" (P1) — companion fix for the read N+1 in extractLinksFromDB / extractTimelineFromDB. Needs a getPagesBatch(slugs) engine method.
  • No prior TODOs completed by this PR.

Documentation

Updated CLAUDE.md (commit 2b39e90) to reflect v0.12.1:

  • Key files: Added LinkBatchInput / TimelineBatchInput exports on src/core/engine.ts; documented the new batch methods on both engines with the actual SQL shape.
  • extract.ts description: Documented that the in-memory dedup pre-load is gone, candidates are buffered 100 at a time and flushed via the batch methods, and the created counter is truthful on re-runs.
  • Migration registry: Noted phaseASchema timeout bumped from 60s to 600s.
  • Testing section: Updated counts from 1297 unit / 105 E2E → 1412 unit / 119 E2E.

README.md, CONTRIBUTING.md, TODOS.md unchanged (no references to changed areas; deferred items already captured).

Test plan

  • All unit tests pass (bun test): 1412 pass, 0 fail
  • All E2E tests pass against pgvector container: 119 pass, 0 fail (~85s)
  • Migration regression tests guard the helper-index fix structurally + behaviorally
  • PostgresEngine batch methods directly tested via 9 E2E cases against real Postgres+pgvector
  • PGLite batch methods unit-tested with 11 cases
  • FS extract end-to-end test confirms idempotency + truthful counter + dry-run dedup + perf <2s
  • No regressions in existing extract-db.test.ts (idempotency tests at :78-88 and :166-175 still pass)
  • No regressions in existing pglite-engine.test.ts (78/78 including "upsert on same (from, to, type) updates context")

🤖 Generated with Claude Code

garrytan and others added 5 commits April 19, 2026 00:08
Multi-row INSERT...SELECT FROM unnest() JOIN pages ON CONFLICT DO NOTHING
RETURNING 1. 4 array-typed bound parameters (links) or 5 (timeline)
regardless of batch size, sidesteps Postgres's 65535-parameter cap.

Returns count of rows actually inserted (excluding ON CONFLICT no-ops
and JOIN-dropped rows whose slugs don't exist).

Per-row addLink / addTimelineEntry signatures and SQL behavior unchanged.
All 10 existing call sites compile and behave identically.

Tests: 11 PGLite cases (empty batch, missing optionals, within-batch dedup,
JOIN drops missing slug, half-existing batch, batch of 100) + 9 E2E
postgres-engine cases against real Postgres+pgvector.
…hema timeout

Production bug: v0.12.0 schema migration timed out at Supabase Management API's
60s ceiling on brains with 80K+ duplicate timeline rows. The DELETE...USING
self-join was O(n²) without an index on the dedup columns.

Fix: pre-create idx_links_dedup_helper / idx_timeline_dedup_helper on the
dedup columns BEFORE the DELETE, drop after. Turns O(n²) into O(n log n).
On 80K+ rows the migration completes in <1s instead of timing out.

Also bumps the v0.12.0 orchestrator's phaseASchema timeout 60s -> 600s as
belt-and-suspenders for unforeseen slowness.

Exports MIGRATIONS for structural test assertions.

Tests: 2 structural assertions (helper-index DDL must appear in v8/v9 SQL
in the right order — catches regression even at 0-row scale) + 2 behavioral
regression tests (1000-row dedup completes <5s).
Production bug: gbrain extract hung 10+ minutes producing zero output on
47K-page brains. The pre-load loop called engine.getLinks(slug) (or
getTimeline) once per page across engine.listPages({limit: 100000}) — 47K
serial round-trips over the Supabase pooler before the first file was read.

Both engines already enforced uniqueness at the SQL layer
(UNIQUE(from, to, link_type) on links, idx_timeline_dedup on timeline_entries).
The in-memory dedup Set was redundant insurance that became the bottleneck.

Fix: delete the pre-load entirely. Buffer 100 candidates per file walk,
flush via engine.addLinksBatch / engine.addTimelineEntriesBatch. ~99% fewer
DB round-trips per re-extract.

Also fixes counter accuracy: 'created' now counts rows actually inserted
(via batch RETURNING 1 row count). Re-run on a fully-extracted brain
prints 'Done: 0 links' instead of lying.

Dry-run mode keeps a per-run dedup Set so duplicate candidates from N
markdown files print exactly once, not N times.

Batch errors are visible in BOTH json and human modes — silent loss of
100 rows is worse than per-row error visibility.

Tests: extract-fs.test.ts (idempotency + truthful counter + dry-run dedup
+ perf regression guard <2s).
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Reflect what shipped in v0.12.1:
- New engine methods addLinksBatch + addTimelineEntriesBatch (PGLite via
  unnest() + manual $N, postgres-engine via INSERT...SELECT FROM
  unnest($1::text[], ...) JOIN pages ON CONFLICT DO NOTHING).
- extract.ts no longer pre-loads dedup set; candidates are buffered 100
  at a time and flushed via the new batch methods.
- v0.12.0 orchestrator phaseASchema timeout bumped 60s to 600s.
- Test counts 1297 unit / 105 E2E to 1412 unit / 119 E2E.
- New test/extract-fs.test.ts covers the N+1 regression guard.
- BrainEngine method count 37/38 to 40.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@garrytan garrytan merged commit 699db50 into master Apr 18, 2026
4 checks passed
joedanz added a commit to joedanz/pbrain that referenced this pull request Apr 19, 2026
…rytan#198)

Adds addLinksBatch / addTimelineEntriesBatch to the BrainEngine
interface. Both engines (PGLite + Postgres) implement the batch as a
single `INSERT ... SELECT FROM unnest(...) JOIN pages ON CONFLICT
DO NOTHING RETURNING 1` statement — 4-5 array-typed bound parameters
regardless of batch size, sidestepping the 65535-parameter cap and
the postgres-js sql(rows, ...) helper's identifier-escape gotcha.

pbrain extract (file-source) now flushes candidates 100 at a time via
the batch API instead of one write per link/entry. On large brains
this drops wall-clock from "tens of minutes" to "seconds" end-to-end.

Schema changes required for ON CONFLICT (from_page_id, to_page_id,
link_type) to match a real unique index:

- links table UNIQUE constraint widened from 2 columns to 3
  (from_page_id, to_page_id, link_type), matching upstream v0.10.3
  knowledge-graph layer. addLink's ON CONFLICT clause updated in
  both engines. Fresh installs get the new constraint from schema.sql
  / pglite-schema.ts; existing installs pick it up via migration v8.
- src/core/migrate.ts picks up upstream's v5-v10 schema migrations.
  v5-v7 (minion_jobs / inbox / attachments) create empty tables that
  this fork doesn't use; they're harmless and keep the schema
  forward-compatible with a future Minions integration. v8-v10 touch
  the links and timeline_entries tables we already have.

Notes on deferred upstream work:
- The DB-source `pbrain extract --source db` path remains deferred
  with the knowledge-graph layer (depends on src/core/link-extraction.ts
  which we haven't taken). File-source extract is what uses the new
  batch API.
- Upstream's v0.12.0 Knowledge Graph auto-wire orchestrator is NOT
  registered (src/commands/migrations/v0_12_0.ts deleted).

(cherry picked from commit 699db50d6cb3b96c3ba3cea8cd55c78f0f9c3bae)

Co-Authored-By: Garry Tan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant