Skip to content

fix(sync): deadlock on incremental sync with > 10 modified files#132

Open
sunnnybala wants to merge 1 commit intogarrytan:masterfrom
sunnnybala:fix/sync-nested-transaction-deadlock
Open

fix(sync): deadlock on incremental sync with > 10 modified files#132
sunnnybala wants to merge 1 commit intogarrytan:masterfrom
sunnnybala:fix/sync-nested-transaction-deadlock

Conversation

@sunnnybala
Copy link
Copy Markdown
Contributor

Summary

Symptom: gbrain sync (incremental) hangs indefinitely when the diff touches more than 10 syncable files. --full always works. Reproduced with a 15-file commit: unpatched sync hangs forever, patched sync completes in 3.4 seconds on the same diff.

Cause: src/commands/sync.ts wraps the add/modify loop in engine.transaction() when useTransaction > 10. Each file in the loop then calls importFromContent, which opens another engine.transaction() on the same PGLite instance. PGLite transactions are not reentrant — the inner call queues on the same _runExclusiveTransaction mutex the outer is holding, producing a classic recursive-mutex deadlock. Main thread parks in ep_poll, workers park in futex_wait_queue, zero CPU advancement.

Fix: drop the outer transaction wrap. Each file's inner transaction is already atomic; per-file atomicity is also the right granularity (one file's failure should not roll back the others' successful imports).

Diff: 21 insertions, 21 deletions, no behavior change below the ≤10 threshold.


Details

The deadlock chain

  1. Incremental sync enters the useTransaction > 10 branch at sync.ts:225:
    if (useTransaction) {
      await engine.transaction(async () => { await processAddsModifies(); });
    }
  2. engine.transaction in pglite-engine.ts delegates to PGLite:
    async transaction<T>(fn) {
      return this.db.transaction(async (tx) => {
        const txEngine = Object.create(this) as PGLiteEngine;
        Object.defineProperty(txEngine, 'db', { get: () => tx });
        return fn(txEngine);
      });
    }
  3. Closure pitfall: the sync.ts callback is async () => { await processAddsModifies(); } — it does not take txEngine as a parameter. processAddsModifies is defined in the outer scope of performSync and closes over the original engine variable, not the txEngine passed into the callback. So every inner importFile(engine, …) call uses the original PGLiteEngine, whose this.db is the raw PGlite instance (not the outer tx object).
  4. Inside importFromContent (import-file.ts:95):
    await engine.transaction(async (tx) => {
      // putPage, upsertChunks, tag reconciliation, createVersion
    });
    This calls this.db.transaction(...) on the original PGlite instance — the same instance whose exclusive-transaction mutex is currently held by the outer transaction.
  5. PGLite's transaction() (from @electric-sql/pglite chunk-HDIMFN25.js):
    async transaction(s) {
      return await this._checkReady(),
        await this._runExclusiveTransaction(async () => {
          await /*BEGIN*/ call(this, "BEGIN");
          // ... await user callback ...
          // COMMIT / ROLLBACK
        })
    }
    _runExclusiveTransaction is a mutex. The inner call queues on the same mutex the outer is still holding. The outer can't release — it's awaiting the user callback, which is awaiting the inner transaction, which is queued forever.

Reproduction

# From a clean gbrain-brain repo
cd /your/gbrain-brain
for f in $(ls people/*.md | grep -v README | head -11); do
  printf "\n<!-- repro -->\n" >> "$f"
done
git commit -am "repro: 11 non-README .md files"

# Unpatched: hangs forever. ep_poll on main thread, futex_wait on workers.
# Zero CPU advancement after entering transaction branch.
gbrain sync --no-pull --no-embed

# Workaround that already exists: --full uses performFullSync which
# goes through runImport() (a different code path without the outer wrap).
gbrain sync --full --no-pull --no-embed

Evidence captured live:

  • Process state: /proc/<pid>/task/*/wchan → main ep_poll, two worker threads futex_wait_queue
  • CPU ticks frozen: utime 205 stime 47 unchanged across three 2-second samples
  • After applying this patch: same 16-file diff completes in real 0m3.4s with status synced, all pages affected, chunks created

Why this has flown under the radar

  • Only triggers when a single git diff contains more than 10 files that pass isSyncable() (excludes README.md, index.md, schema.md, log.md, hidden dirs, non-.md/.mdx)
  • Most day-to-day commits are well under 10 files and take the else branch, bypassing the wrap entirely
  • gbrain import (bulk ingest) and first-sync go through performFullSyncrunImport, which has no outer wrap
  • sync --full works as an apparent workaround, leading users to blame "PGLite is slow" rather than report a bug
  • The hang is completely silent — no log, no error, no CPU. Looks indistinguishable from a very slow sync

Where it bites in production

  • Large initial backfills (importing months of daily notes, calendar events, email digests in one commit)
  • Citation-fixer / maintenance runs that touch many people/company pages at once — e.g. a dream-cycle job that patches 12+ pages for missing [Source: ...] citations
  • Repo migrations / mass rewrites where a single commit renames or reformats many files
  • Any agent that batches brain updates before a single git commit + gbrain sync

Why removing the outer wrap is safe

  • importFromContent is already atomic per file (its inner transaction commits the pages row, tags, chunks, and page_versions together)
  • The outer wrap did not add any atomicity guarantee beyond "all-or-nothing across files", which is actually worse semantics: a single malformed file would roll back every successfully imported file in the same sync run
  • Per-file atomicity means partial sync progress survives a transient failure mid-run, and the next sync picks up where the failed one left off
  • performFullSync path has always done per-file inner transactions with no outer wrap — this fix simply makes the incremental path match

The diff

-  // Process adds and modifies
-  const useTransaction = (filtered.added.length + filtered.modified.length) > 10;
-  const processAddsModifies = async () => {
-    for (const path of [...filtered.added, ...filtered.modified]) {
-      const filePath = join(repoPath, path);
-      if (!existsSync(filePath)) continue;
-      try {
-        const result = await importFile(engine, filePath, path, { noEmbed });
-        if (result.status === 'imported') {
-          chunksCreated += result.chunks;
-          pagesAffected.push(result.slug);
-        }
-      } catch (e: unknown) {
-        const msg = e instanceof Error ? e.message : String(e);
-        console.error(`  Warning: skipped ${path}: ${msg}`);
-      }
-    }
-  };
-
-  if (useTransaction) {
-    await engine.transaction(async () => { await processAddsModifies(); });
-  } else {
-    await processAddsModifies();
-  }
+  // Process adds and modifies.
+  //
+  // NOTE: do NOT wrap this loop in engine.transaction(). importFromContent
+  // already opens its own inner transaction per file, and PGLite transactions
+  // are not reentrant — they acquire the same _runExclusiveTransaction mutex,
+  // so a nested call from inside a user callback queues forever on the mutex
+  // the outer transaction is still holding. Result: incremental sync hangs in
+  // ep_poll whenever the diff crosses the old > 10 threshold that used to
+  // trigger the outer wrap. Per-file atomicity is also the right granularity:
+  // one file's failure should not roll back the others' successful imports.
+  for (const path of [...filtered.added, ...filtered.modified]) {
+    const filePath = join(repoPath, path);
+    if (!existsSync(filePath)) continue;
+    try {
+      const result = await importFile(engine, filePath, path, { noEmbed });
+      if (result.status === 'imported') {
+        chunksCreated += result.chunks;
+        pagesAffected.push(result.slug);
+      }
+    } catch (e: unknown) {
+      const msg = e instanceof Error ? e.message : String(e);
+      console.error(`  Warning: skipped ${path}: ${msg}`);
+    }
+  }

sync.ts wraps the add/modify loop in engine.transaction(), and each
importFromContent inside opens another one. PGLite's
_runExclusiveTransaction is a non-reentrant mutex — the second call
queues on the mutex the first is holding, and the process hangs forever
in ep_poll. Reproduced with a 15-file commit: unpatched hangs, patched
runs in 3.4s. Fix drops the outer wrap; per-file atomicity is correct
anyway (one file's failure should not roll back the others).
garrytan added a commit that referenced this pull request Apr 19, 2026
Reads src/commands/sync.ts verbatim and asserts no uncommented
engine.transaction() call appears above the add/modify loop. Protects
against silent reintroduction of the nested-mutex deadlock that hung
> 10-file syncs forever in ep_poll.
garrytan added a commit that referenced this pull request Apr 19, 2026
Master shipped v0.12.1 (extract N+1 + migration timeout) and v0.12.2
(JSONB double-encode + splitBody + wiki types + parseEmbedding) while
this wave was mid-flight. Ships the remaining pieces as v0.12.3:

- sync deadlock (#132, @sunnnybala)
- statement_timeout scoping (#158, @garagon)
- Obsidian wikilinks + domain patterns (#187 slice, @knee5)
- gbrain orphans command (#187 slice, @knee5)
- tryParseEmbedding() availability helper
- doctor detection for jsonb_integrity + markdown_body_completeness

No schema, no migration, no data touch.

Co-Authored-By: Claude Opus 4.7 <[email protected]>
garrytan added a commit that referenced this pull request Apr 19, 2026
…kilinks, orphans (#216)

* fix(sync): remove nested transaction that deadlocks > 10 file syncs

sync.ts wraps the add/modify loop in engine.transaction(), and each
importFromContent inside opens another one. PGLite's
_runExclusiveTransaction is a non-reentrant mutex — the second call
queues on the mutex the first is holding, and the process hangs forever
in ep_poll. Reproduced with a 15-file commit: unpatched hangs, patched
runs in 3.4s. Fix drops the outer wrap; per-file atomicity is correct
anyway (one file's failure should not roll back the others).

(cherry picked from commit 4a1ac00)

* test(sync): regression guard for #132 top-level engine.transaction wrap

Reads src/commands/sync.ts verbatim and asserts no uncommented
engine.transaction() call appears above the add/modify loop. Protects
against silent reintroduction of the nested-mutex deadlock that hung
> 10-file syncs forever in ep_poll.

* feat(utils): tryParseEmbedding() skip+warn sibling for availability path

parseEmbedding() throws on structural corruption — right call for ingest/
migrate paths where silent skips would be data loss. Wrong call for
search/rescore paths where one corrupt row in 10K would kill every
query that touches it.

tryParseEmbedding() wraps parseEmbedding in try/catch: returns null on
any shape that would throw, warns once per session so the bad row is
visible in logs. Use it anywhere we'd rather degrade ranking than blow
up the whole query.

Retrofit postgres-engine.getEmbeddingsByChunkIds (the #175 slice call
site) — the 5-line rescore loop was the direct motivator. Keep the
throwing parseEmbedding() for everything else (pglite-engine rowToChunk,
migrate-engine round-trips, ingest).

* postgres-engine: scope search statement_timeout to the transaction

searchKeyword and searchVector run on a pooled postgres.js client
(max: 10 by default). The original code bounded each search with

  await sql`SET statement_timeout = '8s'`
  try { await sql`<query>` }
  finally { await sql`SET statement_timeout = '0'` }

but every tagged template is an independent round-trip that picks an
arbitrary connection from the pool. The SET, the query, and the reset
could all land on DIFFERENT connections. In practice the GUC sticks
to whichever connection ran the SET and then gets returned to the
pool — the next unrelated caller on that connection inherits the 8s
timeout (clipping legitimate long queries) or the reset-to-0 (disabling
the guard for whoever expected it). A crash in the middle leaves the
state set permanently.

Wrap each search in sql.begin(async sql => …). postgres.js reserves
a single connection for the transaction body, so the SET LOCAL, the
query, and the implicit COMMIT all run on the same connection. SET
LOCAL scopes the GUC to the transaction — COMMIT or ROLLBACK restores
the previous value automatically, regardless of the code path out.
Error paths can no longer leak the GUC.

No API change. Timeout value and semantics are identical (8s cap on
search queries, no effect on embed --all / bulk import which runs
outside these methods). Only one transaction per search — BEGIN +
COMMIT round-trips are negligible next to a ranked FTS or pgvector
query.

Also closes the earlier audit finding R4-F002 which reported the same
pattern on searchKeyword. This PR covers both searchKeyword and
searchVector so the pool-leak class is fully closed.

Tests (test/postgres-engine.test.ts, new file):
- No bare SET statement_timeout remains after stripping comments.
- searchKeyword and searchVector each wrap their query in sql.begin.
- Both use SET LOCAL.
- Neither explicitly clears the timeout with SET statement_timeout=0.

Source-level guardrails keep the fast unit suite DB-free. Live
Postgres coverage of the search path is in test/e2e/search-quality.test.ts,
which continues to exercise these methods end-to-end against
pgvector when DATABASE_URL is set.

(cherry picked from commit 6146c3b)

* feat(orphans): add gbrain orphans command for finding under-connected pages

Surfaces pages with zero inbound wikilinks. Essential for content
enrichment cycles in KBs with 1000+ pages. By default filters out
auto-generated pages, raw sources, and pseudo-pages where no inbound
links is expected; --include-pseudo to disable.

Supports text (grouped by domain), --json, --count outputs.
Also exposed as find_orphans MCP operation.

Tests cover basic detection, filtering, all output modes.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
(cherry picked from commit f50954f)

* feat(extract): support Obsidian wikilinks + wiki-style domain slugs in canonical extractor

extractEntityRefs now recognizes both syntaxes equally:
  [Name](people/slug)      -- upstream original
  [[people/slug|Name]]     -- Obsidian wikilink (new)

Extends DIR_PATTERN to include domain-organized wiki slugs used by
Karpathy-style knowledge bases:
  - entities  (legacy prefix some brains keep during migration)
  - projects  (gbrain canonical, was missing from regex)
  - tech, finance, personal, openclaw (domain-organized wiki roots)

Before this change, a 2,100-page brain with wikilinks throughout extracted
zero auto-links on put_page because the regex only matched markdown-style
[name](path). After: 1,377 new typed edges on a single extract --source db
pass over the same corpus.

Matches the behavior of the extract.ts filesystem walker (which already
handled wikilinks as of the wiki-markdown-compat fix wave), so the db and
fs sources now produce the same link graph from the same content.

Both patterns share the DIR_PATTERN constant so adding a new entity dir
only requires updating one string.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
(cherry picked from commit 1cfb156)

* feat(doctor): jsonb_integrity + markdown_body_completeness detection

Add two v0.12.1-era reliability checks to `gbrain doctor`:

- `jsonb_integrity` scans the 4 known write sites from the v0.12.0
  double-encode bug (pages.frontmatter, raw_data.data,
  ingest_log.pages_updated, files.metadata) and reports rows where
  jsonb_typeof(col) = 'string'. The fix hint points at
  `gbrain repair-jsonb` (the standalone repair command shipped in
  v0.12.1).

- `markdown_body_completeness` flags pages whose compiled_truth is
  <30% of the raw source content length when raw has multiple H2/H3
  boundaries. Heuristic only; suggests `gbrain sync --force` or
  `gbrain import --force <slug>`.

Also adds test/e2e/jsonb-roundtrip.test.ts — the regression coverage
that should have caught the original double-encode bug. Hits all four
write sites against real Postgres and asserts jsonb_typeof='object'
plus `->>'key'` returns the expected scalar.

Detection only: doctor diagnoses, `gbrain repair-jsonb` treats.
No overlap with the standalone repair path.

* chore: bump to v0.12.3 + changelog (reliability wave)

Master shipped v0.12.1 (extract N+1 + migration timeout) and v0.12.2
(JSONB double-encode + splitBody + wiki types + parseEmbedding) while
this wave was mid-flight. Ships the remaining pieces as v0.12.3:

- sync deadlock (#132, @sunnnybala)
- statement_timeout scoping (#158, @garagon)
- Obsidian wikilinks + domain patterns (#187 slice, @knee5)
- gbrain orphans command (#187 slice, @knee5)
- tryParseEmbedding() availability helper
- doctor detection for jsonb_integrity + markdown_body_completeness

No schema, no migration, no data touch.

Co-Authored-By: Claude Opus 4.7 <[email protected]>

* docs: update project documentation for v0.12.3

CLAUDE.md:
- Add src/commands/orphans.ts entry
- Expand src/commands/doctor.ts with v0.12.3 jsonb_integrity +
  markdown_body_completeness check descriptions
- Update src/core/link-extraction.ts to mention Obsidian wikilinks +
  extended DIR_PATTERN (entities/projects/tech/finance/personal/openclaw)
- Update src/core/utils.ts to mention tryParseEmbedding sibling
- Update src/core/postgres-engine.ts to note statement_timeout scoping +
  tryParseEmbedding usage in getEmbeddingsByChunkIds
- Add Key commands added in v0.12.3 section (orphans, doctor checks)
- Add test/orphans.test.ts, test/postgres-engine.test.ts, updated
  descriptions for test/sync.test.ts, test/doctor.test.ts,
  test/utils.test.ts
- Add test/e2e/jsonb-roundtrip.test.ts with note on intentional overlap
- Bump operation count from ~36 to ~41 (find_orphans shipped in v0.12.3)

README.md:
- Add gbrain orphans to ADMIN commands block

Co-Authored-By: Claude Opus 4.7 <[email protected]>

---------

Co-authored-by: sunnnybala <[email protected]>
Co-authored-by: Gustavo Aragon <[email protected]>
Co-authored-by: Clevin Canales <[email protected]>
Co-authored-by: Claude Sonnet 4.6 <[email protected]>
Co-authored-by: Clevin Canales <[email protected]>
garrytan added a commit that referenced this pull request Apr 20, 2026
Extract, import, and sync now stream per-file progress to stderr through
the shared reporter. All three kept their stdout summaries + JSON
action-events intact so existing tests + agent scripts are unaffected.

- extract.ts (4 paths: links/timeline × fs/db): replaced the ad-hoc
  `process.stderr.write({event:"progress"...})` lines with reporter
  ticks. Same channel (stderr), canonical schema now, visible in both
  text and --json modes. Stdout action-events (`add_link` /
  `add_timeline`) untouched — tests grep them.
- import.ts: the logProgress() function that printed every 100 files to
  stdout is now a progress.tick() call per file. Rate-gated by the
  reporter. Stdout still gets the final "Import complete (Xs)" summary
  and the --json payload.
- sync.ts: three new phases (`sync.deletes`, `sync.renames`,
  `sync.imports`) tick per file, so big syncs show each step rather than
  a single end-of-run summary. Phase hierarchy ready to be child()-chained
  into runImport / runEmbed later, per Codex review #26.

Updated the #132 nested-transaction regression test in test/sync.test.ts
to also accept the new hoisted-loop shape — the guarantee (this loop is
not wrapped in engine.transaction) still holds.

1686 unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
garrytan added a commit that referenced this pull request Apr 22, 2026
…t-visible heartbeats) (#293)

* feat(progress): step 1 - shared ProgressReporter + CliOptions

Adds the foundation for v0.14.2's bulk-action progress streaming work:

- src/core/progress.ts: dependency-free reporter with auto/human/json/quiet
  modes, TTY-aware rendering, time+item rate gating, heartbeat helper for
  slow single queries, dot-composed child phases, EPIPE defense (both sync
  throw and async 'error' event), and a singleton module-level signal
  coordinator so SIGINT/SIGTERM emits abort events for all live phases
  without leaking per-instance listeners.

- src/core/cli-options.ts: parseGlobalFlags() for --quiet /
  --progress-json / --progress-interval=<ms> (both space and = forms),
  plus cliOptsToProgressOptions() that resolves to the right mode. Non-TTY
  default is human-plain one-line-per-event; JSON is explicit opt-in so
  shell pipelines don't suddenly see structured noise.

- test/progress.test.ts (17 cases): mode resolution, rate gating, no-fake-
  totals on heartbeat paths, EPIPE paths, SIGINT singleton, child phase
  composition.

- test/cli-options.test.ts (14 cases): flag parsing, invalid values,
  interleaved flags, mode resolution.

Follow-ups wire doctor/embed/files/export/extract/import/sync/migrate/
repair-jsonb/backlinks/orphans/lint/integrity/eval/autopilot/jobs plus
the apply-migrations orchestrators through this reporter, and route
Minion handlers to job.updateProgress instead of stderr. See the plan
in ~/.claude/plans/.

1682 unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* feat(progress): step 2 - wire global flags into cli.ts

Parse --quiet / --progress-json / --progress-interval from argv BEFORE
command dispatch, strip them, stash resolved CliOptions on a module-level
singleton (same pattern as Commander's program.opts()) and on every
OperationContext created for shared-op dispatch.

- src/cli.ts: parseGlobalFlags(rawArgs) at the top of main(); setCliOptions
  once; dispatch sees only the stripped argv. Fixes the "gbrain
  --progress-json doctor" unknown-command case that Codex flagged.
- src/core/cli-options.ts: expose setCliOptions/getCliOptions/
  _resetCliOptionsForTest singleton. Commands that want progress call
  getCliOptions() to construct their reporter.
- src/core/operations.ts: OperationContext gains optional cliOpts field
  so shared-op handlers (and MCP-invoked ops that need a reporter) can
  read the same settings. MCP callers leave it undefined and consumers
  default to quiet.
- test/cli-options.test.ts: +4 cases covering singleton round-trip and
  an integration smoke spawning `bun src/cli.ts --progress-json --version`
  to prove the global flag survives dispatch.

45 relevant unit tests pass (progress + cli-options + cli.test.ts).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* feat(progress): step 3a - doctor + orphans heartbeat streaming

Doctor on a 52K-page brain used to sit silent for 10+ minutes while the
DB checks ran, then get killed by an agent timeout. Wired through the
new reporter so agents see which check is running and the slow ones
heartbeat every second.

doctor.ts:
- Start a single `doctor.db_checks` phase around the DB section, with a
  per-check heartbeat before each step (connection, pgvector, rls,
  schema_version, embeddings, graph_coverage, integrity, jsonb_integrity,
  markdown_body_completeness).
- jsonb_integrity now scans 5 targets, not 4: added page_versions.
  frontmatter so the check surface matches `repair-jsonb` (per Codex
  review of the plan — the old 4-target scan missed a known repair site).
  Per-target heartbeat so 50K-row scans show incremental progress.
- markdown_body_completeness: wrap the existing query in a 1s heartbeat
  timer. The regex scan over rd.data ->> 'content' can't be paginated
  usefully; this just lets agents see life during the sequential scan.
  No fake totals — the LIMIT 100 query has no meaningful total count.
- integrity sample: same heartbeat pattern around the 500-page scan.

orphans.ts:
- findOrphans() wraps the NOT EXISTS anti-join in a 1s heartbeat.
  Keyset pagination was considered and rejected: without an index on
  links.to_page_id it's no faster than the full scan, and may re-plan
  the anti-join per batch. A schema migration adding that index is the
  right fix and is queued for v0.14.3.

Follow-ups:
- Step 3b: wire embed/files/export (the \r-only stdout offenders).
- Step 5: end-to-end progress test spawning `gbrain doctor --progress-json`
  against a fixture brain, asserting stderr events and clean stdout.

All existing unit tests continue to pass (76/76 in doctor + orphans +
progress + cli-options).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* feat(progress): step 3b - embed + files + export stderr progress

Replaces the \r-on-stdout progress pattern in the three worst offenders
(embed, files sync, export) with the shared reporter on stderr. Stdout
now carries only final summaries, so scripts and tests that grep for
counts ("Embedded N chunks", "Files sync complete", "Exported N pages")
still work when output is piped.

- embed.ts: runEmbedCore accepts an optional onProgress callback. The
  CLI wrapper builds a reporter and passes reporter.tick(); Minion
  handlers will pass job.updateProgress in Step 4. Worker-pool is
  single-threaded JS so no rate-gate race (per Codex review #18).
- files.ts syncFiles(): tick per file; summary preserved on stdout.
- export.ts: tick per page; summary preserved on stdout.

Also fixes a --quiet flag collision. `skillpack-check` has its own
--quiet mode (suppress all stdout). parseGlobalFlags strips --quiet
globally now, and skillpack-check reads the resolved CliOptions
singleton via getCliOptions() instead of re-parsing argv. Test updated
to match the stripping behavior.

1686 unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* feat(progress): step 3c - extract + import + sync reporter streaming

Extract, import, and sync now stream per-file progress to stderr through
the shared reporter. All three kept their stdout summaries + JSON
action-events intact so existing tests + agent scripts are unaffected.

- extract.ts (4 paths: links/timeline × fs/db): replaced the ad-hoc
  `process.stderr.write({event:"progress"...})` lines with reporter
  ticks. Same channel (stderr), canonical schema now, visible in both
  text and --json modes. Stdout action-events (`add_link` /
  `add_timeline`) untouched — tests grep them.
- import.ts: the logProgress() function that printed every 100 files to
  stdout is now a progress.tick() call per file. Rate-gated by the
  reporter. Stdout still gets the final "Import complete (Xs)" summary
  and the --json payload.
- sync.ts: three new phases (`sync.deletes`, `sync.renames`,
  `sync.imports`) tick per file, so big syncs show each step rather than
  a single end-of-run summary. Phase hierarchy ready to be child()-chained
  into runImport / runEmbed later, per Codex review #26.

Updated the #132 nested-transaction regression test in test/sync.test.ts
to also accept the new hoisted-loop shape — the guarantee (this loop is
not wrapped in engine.transaction) still holds.

1686 unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* feat(progress): step 3d - migrate/repair/backlinks/lint/integrity/eval

Wires the remaining bulk commands through the reporter:

- migrate-engine: phase starts (migrate.copy_pages, migrate.copy_links),
  per-page tick. Old \"Progress: N/total\" stdout logs replaced by
  stderr ticks; final stdout summary preserved.
- repair-jsonb: per-column start + a heartbeat timer while each UPDATE
  runs (minutes on 50K-row tables). CRITICAL: stdout stays clean so
  migrations/v0_12_2.ts's JSON.parse(child.stdout) still works. Per
  Codex review #12.
- backlinks: 1s heartbeat around findBacklinkGaps() (sync double-walk
  of the brain dir).
- lint: tick per page; per-issue stdout output preserved.
- integrity auto: tick per page in the main resolver loop. The separate
  ~/.gbrain/integrity-progress.jsonl resume marker is untouched (its
  role shifts from live progress reporting to resume-only).
- eval: add an onProgress option to core's runEval(), CLI wraps with a
  reporter. Phases: eval.single / eval.ab. Tick per query.

core/search/eval.ts gains a RunEvalOptions type so future callers (MCP
eval op, Minion handlers) can also hook in without the reporter.

1686 unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* feat(progress): step 3e - onProgress callbacks on core libs

- src/core/embedding.ts: embedBatch() gains an optional
  EmbedBatchOptions.onBatchComplete callback, fired after each 100-item
  sub-batch. CLI wrappers pass reporter.tick; Minion handlers can pass
  job.updateProgress.
- src/core/enrichment-service.ts: enrichEntities() config gains
  onProgress(done, total, name) fired after each entity. Same split:
  CLI -> reporter, Minion -> DB-backed progress.

No CLI behavior change on its own. Wiring these callbacks into the
Minion handlers is Step 4.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* feat(progress): step 4 - orchestrators + upgrade + minion handlers

- cli-options.ts: childGlobalFlags() returns the flag suffix to append
  to child gbrain subprocesses. Empty string by default, " --quiet
  --progress-json" when the parent has them set, so child behavior
  inherits the parent's progress-mode without scattering string-concat
  logic across every execSync site.

- migrations/v0_12_2.ts: each execSync inherits the parent's global
  flags. Phase C (repair-jsonb --dry-run --json) pins explicit stdio to
  ['ignore','pipe','inherit'] so child stderr streams straight through
  while stdout stays captured for JSON.parse. Per Codex review #12.
- migrations/v0_12_0.ts + v0_11_0.ts: same childGlobalFlags wiring at
  each gbrain-subcommand execSync.

- upgrade.ts: post-upgrade timeout bumped 300s → 30min (1_800_000 ms)
  with GBRAIN_POST_UPGRADE_TIMEOUT_MS override. The old 300s cap killed
  v0.12.0 graph-backfill migrations on 50K+ brains; the heartbeat
  wiring added in v0.14.2 makes long waits observable, so a generous
  ceiling no longer means users stare at a silent terminal.

- jobs.ts: the embed Minion handler passes job.updateProgress as the
  onProgress callback, so per-job progress is durable in minion_jobs
  and readable via `gbrain jobs get <id>`. Primary Minion progress
  channel is DB-backed — stderr from `jobs work` stays coarse for
  daemon liveness only. Per Codex review #20.

1686 unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* feat(progress): step 5 - E2E doctor-progress test + CI guard

scripts/check-progress-to-stdout.sh greps src/ for the banned
`process.stdout.write('\r…')` pattern that v0.14.2 removed from the
bulk-action codepaths. Wired into the `bun run test` script so any
future regression that puts progress back on stdout fails fast. An
empty allowlist documents the position: every known call site was
migrated; new exceptions need a rationale in the allowlist.

test/e2e/doctor-progress.test.ts (Tier 1, needs Postgres + pgvector):
- `gbrain --progress-json doctor --json`: stderr carries JSONL progress
  events with the canonical {event, phase, ts} shape, starts + finishes
  for `doctor.db_checks`. Stdout stays parseable JSON — no progress
  pollution.
- `gbrain doctor` (no flag): human-plain progress goes to stderr only,
  stdout stays free of `[doctor.db_checks]`.
- `gbrain --quiet doctor`: reporter emits nothing; doctor still runs to
  completion.

test/cli-options.test.ts: +2 spawning integration tests. One verifies
`gbrain --progress-json --version` keeps stdout clean of progress events
(single-shot commands that don't use a reporter aren't affected). One
guards the skillpack-check --quiet regression — --quiet suppresses
stdout by reading the resolved CliOptions singleton, not re-parsing argv.

Full test matrix:
  bun run test           -> 1726 pass / 184 skipped (no DB) / 0 fail
  bun run test:e2e       -> 136 pass / 13 skipped / 0 fail

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* feat(progress): step 6 - docs + v0.14.2 release bump

- VERSION + package.json bumped to 0.14.2.
- docs/progress-events.md (new): canonical JSON event schema reference.
  Stable from v0.14.2, additive only. Lists every phase name shipped
  in this release, the five event types (start/tick/heartbeat/finish/
  abort), the TTY/non-TTY rendering rules, subprocess inheritance
  semantics, and the Minion DB-backed progress model.
- CLAUDE.md: "Bulk-action progress reporting" section under the build
  instructions; Key files entries for src/core/progress.ts,
  src/core/cli-options.ts, scripts/check-progress-to-stdout.sh, and
  docs/progress-events.md; doctor.ts entry updated to note the v0.14.2
  5-target jsonb_integrity scan + heartbeat wiring.
- CHANGELOG.md v0.14.2: full release summary per project voice rules.
  The "numbers that matter" table, per-command before/after grid,
  backward-compat warnings for stdout→stderr moves, and an itemized
  changes section covering reporter/CLI plumbing/schema/Minion
  handlers/doctor fixes/upgrade timeout/CI guard/tests. No em dashes.
  Real file paths, real commands, real numbers.
- skills/migrations/v0.14.2.md (new): agent migration note. Mechanical
  step is "nothing" since v0.14.2 is purely additive. Walks agents
  through the three new global flags, the 14 wired commands, the event
  schema cheat sheet, Minion progress via job.updateProgress, and
  scripts/verification commands.

Full test matrix:
  bun run test (unit + guards) -> 1726 pass / 184 skipped / 0 fail
  bun run test:e2e (Postgres)  -> 141 pass / 8 skipped / 0 fail

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* chore: bump version to 0.15.2, restore master's [0.14.2] CHANGELOG entry

Master sits at 0.14.2 (reliability wave). This PR lands on top as 0.15.2
(progress streaming wave). Splits the merge-time combined CHANGELOG entry
back into two discrete release sections so history stays honest:

- [0.15.2] = progress reporter, CliOptions, 14 wired commands, Minion
  embed handler, doctor jsonb_integrity 5-target fix, upgrade timeout bump,
  CI guard, progress unit+E2E tests.
- [0.14.2] = master's eight root-cause bug fixes, restored verbatim from
  origin/master.

Touched files:
- VERSION + package.json: 0.14.2 -> 0.15.2 (next patch off master).
- skills/migrations/v0.14.2.md -> skills/migrations/v0.15.2.md (rename
  + rewrite frontmatter + body to v0.15.2).
- CHANGELOG.md: split into two entries; progress-wave refs renamed
  v0.14.2 -> v0.15.2; reliability-wave entry restored from master.
- src/core/progress.ts, src/commands/doctor.ts, src/commands/sync.ts,
  src/commands/upgrade.ts, docs/progress-events.md, test/sync.test.ts:
  progress-wave v0.14.2 references -> v0.15.2. The remaining v0.14.2
  references in test/e2e/migration-flow.test.ts (Bug 3 context) and
  CLAUDE.md (reliability-wave key commands, Bug 3 ledger move) correctly
  point at master's 0.14.2 release.

Test matrix after version bump:
  bun run test -> 1780 pass / 179 skipped / 0 fail

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant