fix(sync): deadlock on incremental sync with > 10 modified files#132
Open
sunnnybala wants to merge 1 commit intogarrytan:masterfrom
Open
fix(sync): deadlock on incremental sync with > 10 modified files#132sunnnybala wants to merge 1 commit intogarrytan:masterfrom
sunnnybala wants to merge 1 commit intogarrytan:masterfrom
Conversation
sync.ts wraps the add/modify loop in engine.transaction(), and each importFromContent inside opens another one. PGLite's _runExclusiveTransaction is a non-reentrant mutex — the second call queues on the mutex the first is holding, and the process hangs forever in ep_poll. Reproduced with a 15-file commit: unpatched hangs, patched runs in 3.4s. Fix drops the outer wrap; per-file atomicity is correct anyway (one file's failure should not roll back the others).
7 tasks
This was referenced Apr 18, 2026
garrytan
added a commit
that referenced
this pull request
Apr 19, 2026
Reads src/commands/sync.ts verbatim and asserts no uncommented engine.transaction() call appears above the add/modify loop. Protects against silent reintroduction of the nested-mutex deadlock that hung > 10-file syncs forever in ep_poll.
garrytan
added a commit
that referenced
this pull request
Apr 19, 2026
Master shipped v0.12.1 (extract N+1 + migration timeout) and v0.12.2 (JSONB double-encode + splitBody + wiki types + parseEmbedding) while this wave was mid-flight. Ships the remaining pieces as v0.12.3: - sync deadlock (#132, @sunnnybala) - statement_timeout scoping (#158, @garagon) - Obsidian wikilinks + domain patterns (#187 slice, @knee5) - gbrain orphans command (#187 slice, @knee5) - tryParseEmbedding() availability helper - doctor detection for jsonb_integrity + markdown_body_completeness No schema, no migration, no data touch. Co-Authored-By: Claude Opus 4.7 <[email protected]>
4 tasks
garrytan
added a commit
that referenced
this pull request
Apr 19, 2026
…kilinks, orphans (#216) * fix(sync): remove nested transaction that deadlocks > 10 file syncs sync.ts wraps the add/modify loop in engine.transaction(), and each importFromContent inside opens another one. PGLite's _runExclusiveTransaction is a non-reentrant mutex — the second call queues on the mutex the first is holding, and the process hangs forever in ep_poll. Reproduced with a 15-file commit: unpatched hangs, patched runs in 3.4s. Fix drops the outer wrap; per-file atomicity is correct anyway (one file's failure should not roll back the others). (cherry picked from commit 4a1ac00) * test(sync): regression guard for #132 top-level engine.transaction wrap Reads src/commands/sync.ts verbatim and asserts no uncommented engine.transaction() call appears above the add/modify loop. Protects against silent reintroduction of the nested-mutex deadlock that hung > 10-file syncs forever in ep_poll. * feat(utils): tryParseEmbedding() skip+warn sibling for availability path parseEmbedding() throws on structural corruption — right call for ingest/ migrate paths where silent skips would be data loss. Wrong call for search/rescore paths where one corrupt row in 10K would kill every query that touches it. tryParseEmbedding() wraps parseEmbedding in try/catch: returns null on any shape that would throw, warns once per session so the bad row is visible in logs. Use it anywhere we'd rather degrade ranking than blow up the whole query. Retrofit postgres-engine.getEmbeddingsByChunkIds (the #175 slice call site) — the 5-line rescore loop was the direct motivator. Keep the throwing parseEmbedding() for everything else (pglite-engine rowToChunk, migrate-engine round-trips, ingest). * postgres-engine: scope search statement_timeout to the transaction searchKeyword and searchVector run on a pooled postgres.js client (max: 10 by default). The original code bounded each search with await sql`SET statement_timeout = '8s'` try { await sql`<query>` } finally { await sql`SET statement_timeout = '0'` } but every tagged template is an independent round-trip that picks an arbitrary connection from the pool. The SET, the query, and the reset could all land on DIFFERENT connections. In practice the GUC sticks to whichever connection ran the SET and then gets returned to the pool — the next unrelated caller on that connection inherits the 8s timeout (clipping legitimate long queries) or the reset-to-0 (disabling the guard for whoever expected it). A crash in the middle leaves the state set permanently. Wrap each search in sql.begin(async sql => …). postgres.js reserves a single connection for the transaction body, so the SET LOCAL, the query, and the implicit COMMIT all run on the same connection. SET LOCAL scopes the GUC to the transaction — COMMIT or ROLLBACK restores the previous value automatically, regardless of the code path out. Error paths can no longer leak the GUC. No API change. Timeout value and semantics are identical (8s cap on search queries, no effect on embed --all / bulk import which runs outside these methods). Only one transaction per search — BEGIN + COMMIT round-trips are negligible next to a ranked FTS or pgvector query. Also closes the earlier audit finding R4-F002 which reported the same pattern on searchKeyword. This PR covers both searchKeyword and searchVector so the pool-leak class is fully closed. Tests (test/postgres-engine.test.ts, new file): - No bare SET statement_timeout remains after stripping comments. - searchKeyword and searchVector each wrap their query in sql.begin. - Both use SET LOCAL. - Neither explicitly clears the timeout with SET statement_timeout=0. Source-level guardrails keep the fast unit suite DB-free. Live Postgres coverage of the search path is in test/e2e/search-quality.test.ts, which continues to exercise these methods end-to-end against pgvector when DATABASE_URL is set. (cherry picked from commit 6146c3b) * feat(orphans): add gbrain orphans command for finding under-connected pages Surfaces pages with zero inbound wikilinks. Essential for content enrichment cycles in KBs with 1000+ pages. By default filters out auto-generated pages, raw sources, and pseudo-pages where no inbound links is expected; --include-pseudo to disable. Supports text (grouped by domain), --json, --count outputs. Also exposed as find_orphans MCP operation. Tests cover basic detection, filtering, all output modes. Co-Authored-By: Claude Sonnet 4.6 <[email protected]> (cherry picked from commit f50954f) * feat(extract): support Obsidian wikilinks + wiki-style domain slugs in canonical extractor extractEntityRefs now recognizes both syntaxes equally: [Name](people/slug) -- upstream original [[people/slug|Name]] -- Obsidian wikilink (new) Extends DIR_PATTERN to include domain-organized wiki slugs used by Karpathy-style knowledge bases: - entities (legacy prefix some brains keep during migration) - projects (gbrain canonical, was missing from regex) - tech, finance, personal, openclaw (domain-organized wiki roots) Before this change, a 2,100-page brain with wikilinks throughout extracted zero auto-links on put_page because the regex only matched markdown-style [name](path). After: 1,377 new typed edges on a single extract --source db pass over the same corpus. Matches the behavior of the extract.ts filesystem walker (which already handled wikilinks as of the wiki-markdown-compat fix wave), so the db and fs sources now produce the same link graph from the same content. Both patterns share the DIR_PATTERN constant so adding a new entity dir only requires updating one string. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> (cherry picked from commit 1cfb156) * feat(doctor): jsonb_integrity + markdown_body_completeness detection Add two v0.12.1-era reliability checks to `gbrain doctor`: - `jsonb_integrity` scans the 4 known write sites from the v0.12.0 double-encode bug (pages.frontmatter, raw_data.data, ingest_log.pages_updated, files.metadata) and reports rows where jsonb_typeof(col) = 'string'. The fix hint points at `gbrain repair-jsonb` (the standalone repair command shipped in v0.12.1). - `markdown_body_completeness` flags pages whose compiled_truth is <30% of the raw source content length when raw has multiple H2/H3 boundaries. Heuristic only; suggests `gbrain sync --force` or `gbrain import --force <slug>`. Also adds test/e2e/jsonb-roundtrip.test.ts — the regression coverage that should have caught the original double-encode bug. Hits all four write sites against real Postgres and asserts jsonb_typeof='object' plus `->>'key'` returns the expected scalar. Detection only: doctor diagnoses, `gbrain repair-jsonb` treats. No overlap with the standalone repair path. * chore: bump to v0.12.3 + changelog (reliability wave) Master shipped v0.12.1 (extract N+1 + migration timeout) and v0.12.2 (JSONB double-encode + splitBody + wiki types + parseEmbedding) while this wave was mid-flight. Ships the remaining pieces as v0.12.3: - sync deadlock (#132, @sunnnybala) - statement_timeout scoping (#158, @garagon) - Obsidian wikilinks + domain patterns (#187 slice, @knee5) - gbrain orphans command (#187 slice, @knee5) - tryParseEmbedding() availability helper - doctor detection for jsonb_integrity + markdown_body_completeness No schema, no migration, no data touch. Co-Authored-By: Claude Opus 4.7 <[email protected]> * docs: update project documentation for v0.12.3 CLAUDE.md: - Add src/commands/orphans.ts entry - Expand src/commands/doctor.ts with v0.12.3 jsonb_integrity + markdown_body_completeness check descriptions - Update src/core/link-extraction.ts to mention Obsidian wikilinks + extended DIR_PATTERN (entities/projects/tech/finance/personal/openclaw) - Update src/core/utils.ts to mention tryParseEmbedding sibling - Update src/core/postgres-engine.ts to note statement_timeout scoping + tryParseEmbedding usage in getEmbeddingsByChunkIds - Add Key commands added in v0.12.3 section (orphans, doctor checks) - Add test/orphans.test.ts, test/postgres-engine.test.ts, updated descriptions for test/sync.test.ts, test/doctor.test.ts, test/utils.test.ts - Add test/e2e/jsonb-roundtrip.test.ts with note on intentional overlap - Bump operation count from ~36 to ~41 (find_orphans shipped in v0.12.3) README.md: - Add gbrain orphans to ADMIN commands block Co-Authored-By: Claude Opus 4.7 <[email protected]> --------- Co-authored-by: sunnnybala <[email protected]> Co-authored-by: Gustavo Aragon <[email protected]> Co-authored-by: Clevin Canales <[email protected]> Co-authored-by: Claude Sonnet 4.6 <[email protected]> Co-authored-by: Clevin Canales <[email protected]>
garrytan
added a commit
that referenced
this pull request
Apr 20, 2026
Extract, import, and sync now stream per-file progress to stderr through
the shared reporter. All three kept their stdout summaries + JSON
action-events intact so existing tests + agent scripts are unaffected.
- extract.ts (4 paths: links/timeline × fs/db): replaced the ad-hoc
`process.stderr.write({event:"progress"...})` lines with reporter
ticks. Same channel (stderr), canonical schema now, visible in both
text and --json modes. Stdout action-events (`add_link` /
`add_timeline`) untouched — tests grep them.
- import.ts: the logProgress() function that printed every 100 files to
stdout is now a progress.tick() call per file. Rate-gated by the
reporter. Stdout still gets the final "Import complete (Xs)" summary
and the --json payload.
- sync.ts: three new phases (`sync.deletes`, `sync.renames`,
`sync.imports`) tick per file, so big syncs show each step rather than
a single end-of-run summary. Phase hierarchy ready to be child()-chained
into runImport / runEmbed later, per Codex review #26.
Updated the #132 nested-transaction regression test in test/sync.test.ts
to also accept the new hoisted-loop shape — the guarantee (this loop is
not wrapped in engine.transaction) still holds.
1686 unit tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
garrytan
added a commit
that referenced
this pull request
Apr 22, 2026
…t-visible heartbeats) (#293) * feat(progress): step 1 - shared ProgressReporter + CliOptions Adds the foundation for v0.14.2's bulk-action progress streaming work: - src/core/progress.ts: dependency-free reporter with auto/human/json/quiet modes, TTY-aware rendering, time+item rate gating, heartbeat helper for slow single queries, dot-composed child phases, EPIPE defense (both sync throw and async 'error' event), and a singleton module-level signal coordinator so SIGINT/SIGTERM emits abort events for all live phases without leaking per-instance listeners. - src/core/cli-options.ts: parseGlobalFlags() for --quiet / --progress-json / --progress-interval=<ms> (both space and = forms), plus cliOptsToProgressOptions() that resolves to the right mode. Non-TTY default is human-plain one-line-per-event; JSON is explicit opt-in so shell pipelines don't suddenly see structured noise. - test/progress.test.ts (17 cases): mode resolution, rate gating, no-fake- totals on heartbeat paths, EPIPE paths, SIGINT singleton, child phase composition. - test/cli-options.test.ts (14 cases): flag parsing, invalid values, interleaved flags, mode resolution. Follow-ups wire doctor/embed/files/export/extract/import/sync/migrate/ repair-jsonb/backlinks/orphans/lint/integrity/eval/autopilot/jobs plus the apply-migrations orchestrators through this reporter, and route Minion handlers to job.updateProgress instead of stderr. See the plan in ~/.claude/plans/. 1682 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * feat(progress): step 2 - wire global flags into cli.ts Parse --quiet / --progress-json / --progress-interval from argv BEFORE command dispatch, strip them, stash resolved CliOptions on a module-level singleton (same pattern as Commander's program.opts()) and on every OperationContext created for shared-op dispatch. - src/cli.ts: parseGlobalFlags(rawArgs) at the top of main(); setCliOptions once; dispatch sees only the stripped argv. Fixes the "gbrain --progress-json doctor" unknown-command case that Codex flagged. - src/core/cli-options.ts: expose setCliOptions/getCliOptions/ _resetCliOptionsForTest singleton. Commands that want progress call getCliOptions() to construct their reporter. - src/core/operations.ts: OperationContext gains optional cliOpts field so shared-op handlers (and MCP-invoked ops that need a reporter) can read the same settings. MCP callers leave it undefined and consumers default to quiet. - test/cli-options.test.ts: +4 cases covering singleton round-trip and an integration smoke spawning `bun src/cli.ts --progress-json --version` to prove the global flag survives dispatch. 45 relevant unit tests pass (progress + cli-options + cli.test.ts). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * feat(progress): step 3a - doctor + orphans heartbeat streaming Doctor on a 52K-page brain used to sit silent for 10+ minutes while the DB checks ran, then get killed by an agent timeout. Wired through the new reporter so agents see which check is running and the slow ones heartbeat every second. doctor.ts: - Start a single `doctor.db_checks` phase around the DB section, with a per-check heartbeat before each step (connection, pgvector, rls, schema_version, embeddings, graph_coverage, integrity, jsonb_integrity, markdown_body_completeness). - jsonb_integrity now scans 5 targets, not 4: added page_versions. frontmatter so the check surface matches `repair-jsonb` (per Codex review of the plan — the old 4-target scan missed a known repair site). Per-target heartbeat so 50K-row scans show incremental progress. - markdown_body_completeness: wrap the existing query in a 1s heartbeat timer. The regex scan over rd.data ->> 'content' can't be paginated usefully; this just lets agents see life during the sequential scan. No fake totals — the LIMIT 100 query has no meaningful total count. - integrity sample: same heartbeat pattern around the 500-page scan. orphans.ts: - findOrphans() wraps the NOT EXISTS anti-join in a 1s heartbeat. Keyset pagination was considered and rejected: without an index on links.to_page_id it's no faster than the full scan, and may re-plan the anti-join per batch. A schema migration adding that index is the right fix and is queued for v0.14.3. Follow-ups: - Step 3b: wire embed/files/export (the \r-only stdout offenders). - Step 5: end-to-end progress test spawning `gbrain doctor --progress-json` against a fixture brain, asserting stderr events and clean stdout. All existing unit tests continue to pass (76/76 in doctor + orphans + progress + cli-options). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * feat(progress): step 3b - embed + files + export stderr progress Replaces the \r-on-stdout progress pattern in the three worst offenders (embed, files sync, export) with the shared reporter on stderr. Stdout now carries only final summaries, so scripts and tests that grep for counts ("Embedded N chunks", "Files sync complete", "Exported N pages") still work when output is piped. - embed.ts: runEmbedCore accepts an optional onProgress callback. The CLI wrapper builds a reporter and passes reporter.tick(); Minion handlers will pass job.updateProgress in Step 4. Worker-pool is single-threaded JS so no rate-gate race (per Codex review #18). - files.ts syncFiles(): tick per file; summary preserved on stdout. - export.ts: tick per page; summary preserved on stdout. Also fixes a --quiet flag collision. `skillpack-check` has its own --quiet mode (suppress all stdout). parseGlobalFlags strips --quiet globally now, and skillpack-check reads the resolved CliOptions singleton via getCliOptions() instead of re-parsing argv. Test updated to match the stripping behavior. 1686 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * feat(progress): step 3c - extract + import + sync reporter streaming Extract, import, and sync now stream per-file progress to stderr through the shared reporter. All three kept their stdout summaries + JSON action-events intact so existing tests + agent scripts are unaffected. - extract.ts (4 paths: links/timeline × fs/db): replaced the ad-hoc `process.stderr.write({event:"progress"...})` lines with reporter ticks. Same channel (stderr), canonical schema now, visible in both text and --json modes. Stdout action-events (`add_link` / `add_timeline`) untouched — tests grep them. - import.ts: the logProgress() function that printed every 100 files to stdout is now a progress.tick() call per file. Rate-gated by the reporter. Stdout still gets the final "Import complete (Xs)" summary and the --json payload. - sync.ts: three new phases (`sync.deletes`, `sync.renames`, `sync.imports`) tick per file, so big syncs show each step rather than a single end-of-run summary. Phase hierarchy ready to be child()-chained into runImport / runEmbed later, per Codex review #26. Updated the #132 nested-transaction regression test in test/sync.test.ts to also accept the new hoisted-loop shape — the guarantee (this loop is not wrapped in engine.transaction) still holds. 1686 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * feat(progress): step 3d - migrate/repair/backlinks/lint/integrity/eval Wires the remaining bulk commands through the reporter: - migrate-engine: phase starts (migrate.copy_pages, migrate.copy_links), per-page tick. Old \"Progress: N/total\" stdout logs replaced by stderr ticks; final stdout summary preserved. - repair-jsonb: per-column start + a heartbeat timer while each UPDATE runs (minutes on 50K-row tables). CRITICAL: stdout stays clean so migrations/v0_12_2.ts's JSON.parse(child.stdout) still works. Per Codex review #12. - backlinks: 1s heartbeat around findBacklinkGaps() (sync double-walk of the brain dir). - lint: tick per page; per-issue stdout output preserved. - integrity auto: tick per page in the main resolver loop. The separate ~/.gbrain/integrity-progress.jsonl resume marker is untouched (its role shifts from live progress reporting to resume-only). - eval: add an onProgress option to core's runEval(), CLI wraps with a reporter. Phases: eval.single / eval.ab. Tick per query. core/search/eval.ts gains a RunEvalOptions type so future callers (MCP eval op, Minion handlers) can also hook in without the reporter. 1686 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * feat(progress): step 3e - onProgress callbacks on core libs - src/core/embedding.ts: embedBatch() gains an optional EmbedBatchOptions.onBatchComplete callback, fired after each 100-item sub-batch. CLI wrappers pass reporter.tick; Minion handlers can pass job.updateProgress. - src/core/enrichment-service.ts: enrichEntities() config gains onProgress(done, total, name) fired after each entity. Same split: CLI -> reporter, Minion -> DB-backed progress. No CLI behavior change on its own. Wiring these callbacks into the Minion handlers is Step 4. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * feat(progress): step 4 - orchestrators + upgrade + minion handlers - cli-options.ts: childGlobalFlags() returns the flag suffix to append to child gbrain subprocesses. Empty string by default, " --quiet --progress-json" when the parent has them set, so child behavior inherits the parent's progress-mode without scattering string-concat logic across every execSync site. - migrations/v0_12_2.ts: each execSync inherits the parent's global flags. Phase C (repair-jsonb --dry-run --json) pins explicit stdio to ['ignore','pipe','inherit'] so child stderr streams straight through while stdout stays captured for JSON.parse. Per Codex review #12. - migrations/v0_12_0.ts + v0_11_0.ts: same childGlobalFlags wiring at each gbrain-subcommand execSync. - upgrade.ts: post-upgrade timeout bumped 300s → 30min (1_800_000 ms) with GBRAIN_POST_UPGRADE_TIMEOUT_MS override. The old 300s cap killed v0.12.0 graph-backfill migrations on 50K+ brains; the heartbeat wiring added in v0.14.2 makes long waits observable, so a generous ceiling no longer means users stare at a silent terminal. - jobs.ts: the embed Minion handler passes job.updateProgress as the onProgress callback, so per-job progress is durable in minion_jobs and readable via `gbrain jobs get <id>`. Primary Minion progress channel is DB-backed — stderr from `jobs work` stays coarse for daemon liveness only. Per Codex review #20. 1686 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * feat(progress): step 5 - E2E doctor-progress test + CI guard scripts/check-progress-to-stdout.sh greps src/ for the banned `process.stdout.write('\r…')` pattern that v0.14.2 removed from the bulk-action codepaths. Wired into the `bun run test` script so any future regression that puts progress back on stdout fails fast. An empty allowlist documents the position: every known call site was migrated; new exceptions need a rationale in the allowlist. test/e2e/doctor-progress.test.ts (Tier 1, needs Postgres + pgvector): - `gbrain --progress-json doctor --json`: stderr carries JSONL progress events with the canonical {event, phase, ts} shape, starts + finishes for `doctor.db_checks`. Stdout stays parseable JSON — no progress pollution. - `gbrain doctor` (no flag): human-plain progress goes to stderr only, stdout stays free of `[doctor.db_checks]`. - `gbrain --quiet doctor`: reporter emits nothing; doctor still runs to completion. test/cli-options.test.ts: +2 spawning integration tests. One verifies `gbrain --progress-json --version` keeps stdout clean of progress events (single-shot commands that don't use a reporter aren't affected). One guards the skillpack-check --quiet regression — --quiet suppresses stdout by reading the resolved CliOptions singleton, not re-parsing argv. Full test matrix: bun run test -> 1726 pass / 184 skipped (no DB) / 0 fail bun run test:e2e -> 136 pass / 13 skipped / 0 fail Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * feat(progress): step 6 - docs + v0.14.2 release bump - VERSION + package.json bumped to 0.14.2. - docs/progress-events.md (new): canonical JSON event schema reference. Stable from v0.14.2, additive only. Lists every phase name shipped in this release, the five event types (start/tick/heartbeat/finish/ abort), the TTY/non-TTY rendering rules, subprocess inheritance semantics, and the Minion DB-backed progress model. - CLAUDE.md: "Bulk-action progress reporting" section under the build instructions; Key files entries for src/core/progress.ts, src/core/cli-options.ts, scripts/check-progress-to-stdout.sh, and docs/progress-events.md; doctor.ts entry updated to note the v0.14.2 5-target jsonb_integrity scan + heartbeat wiring. - CHANGELOG.md v0.14.2: full release summary per project voice rules. The "numbers that matter" table, per-command before/after grid, backward-compat warnings for stdout→stderr moves, and an itemized changes section covering reporter/CLI plumbing/schema/Minion handlers/doctor fixes/upgrade timeout/CI guard/tests. No em dashes. Real file paths, real commands, real numbers. - skills/migrations/v0.14.2.md (new): agent migration note. Mechanical step is "nothing" since v0.14.2 is purely additive. Walks agents through the three new global flags, the 14 wired commands, the event schema cheat sheet, Minion progress via job.updateProgress, and scripts/verification commands. Full test matrix: bun run test (unit + guards) -> 1726 pass / 184 skipped / 0 fail bun run test:e2e (Postgres) -> 141 pass / 8 skipped / 0 fail Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * chore: bump version to 0.15.2, restore master's [0.14.2] CHANGELOG entry Master sits at 0.14.2 (reliability wave). This PR lands on top as 0.15.2 (progress streaming wave). Splits the merge-time combined CHANGELOG entry back into two discrete release sections so history stays honest: - [0.15.2] = progress reporter, CliOptions, 14 wired commands, Minion embed handler, doctor jsonb_integrity 5-target fix, upgrade timeout bump, CI guard, progress unit+E2E tests. - [0.14.2] = master's eight root-cause bug fixes, restored verbatim from origin/master. Touched files: - VERSION + package.json: 0.14.2 -> 0.15.2 (next patch off master). - skills/migrations/v0.14.2.md -> skills/migrations/v0.15.2.md (rename + rewrite frontmatter + body to v0.15.2). - CHANGELOG.md: split into two entries; progress-wave refs renamed v0.14.2 -> v0.15.2; reliability-wave entry restored from master. - src/core/progress.ts, src/commands/doctor.ts, src/commands/sync.ts, src/commands/upgrade.ts, docs/progress-events.md, test/sync.test.ts: progress-wave v0.14.2 references -> v0.15.2. The remaining v0.14.2 references in test/e2e/migration-flow.test.ts (Bug 3 context) and CLAUDE.md (reliability-wave key commands, Bug 3 ledger move) correctly point at master's 0.14.2 release. Test matrix after version bump: bun run test -> 1780 pass / 179 skipped / 0 fail Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Symptom:
gbrain sync(incremental) hangs indefinitely when the diff touches more than 10 syncable files.--fullalways works. Reproduced with a 15-file commit: unpatched sync hangs forever, patched sync completes in 3.4 seconds on the same diff.Cause:
src/commands/sync.tswraps the add/modify loop inengine.transaction()whenuseTransaction > 10. Each file in the loop then callsimportFromContent, which opens anotherengine.transaction()on the same PGLite instance. PGLite transactions are not reentrant — the inner call queues on the same_runExclusiveTransactionmutex the outer is holding, producing a classic recursive-mutex deadlock. Main thread parks inep_poll, workers park infutex_wait_queue, zero CPU advancement.Fix: drop the outer transaction wrap. Each file's inner transaction is already atomic; per-file atomicity is also the right granularity (one file's failure should not roll back the others' successful imports).
Diff: 21 insertions, 21 deletions, no behavior change below the ≤10 threshold.
Details
The deadlock chain
useTransaction > 10branch atsync.ts:225:engine.transactioninpglite-engine.tsdelegates to PGLite:async () => { await processAddsModifies(); }— it does not taketxEngineas a parameter.processAddsModifiesis defined in the outer scope ofperformSyncand closes over the originalenginevariable, not thetxEnginepassed into the callback. So every innerimportFile(engine, …)call uses the originalPGLiteEngine, whosethis.dbis the rawPGliteinstance (not the outertxobject).importFromContent(import-file.ts:95):this.db.transaction(...)on the originalPGliteinstance — the same instance whose exclusive-transaction mutex is currently held by the outer transaction.transaction()(from@electric-sql/pglitechunk-HDIMFN25.js):_runExclusiveTransactionis a mutex. The inner call queues on the same mutex the outer is still holding. The outer can't release — it's awaiting the user callback, which is awaiting the inner transaction, which is queued forever.Reproduction
Evidence captured live:
/proc/<pid>/task/*/wchan→ mainep_poll, two worker threadsfutex_wait_queueutime 205 stime 47unchanged across three 2-second samplesreal 0m3.4swith statussynced, all pages affected, chunks createdWhy this has flown under the radar
git diffcontains more than 10 files that passisSyncable()(excludesREADME.md,index.md,schema.md,log.md, hidden dirs, non-.md/.mdx)elsebranch, bypassing the wrap entirelygbrain import(bulk ingest) and first-sync go throughperformFullSync→runImport, which has no outer wrapsync --fullworks as an apparent workaround, leading users to blame "PGLite is slow" rather than report a bugWhere it bites in production
[Source: ...]citationsgit commit+gbrain syncWhy removing the outer wrap is safe
importFromContentis already atomic per file (its inner transaction commits thepagesrow,tags,chunks, andpage_versionstogether)performFullSyncpath has always done per-file inner transactions with no outer wrap — this fix simply makes the incremental path matchThe diff