Skip to content

fix: JSONB double-encode + splitBody wiki + parseEmbedding (v0.12.1)#196

Merged
garrytan merged 7 commits intomasterfrom
garrytan/jsonb-hotfix
Apr 18, 2026
Merged

fix: JSONB double-encode + splitBody wiki + parseEmbedding (v0.12.1)#196
garrytan merged 7 commits intomasterfrom
garrytan/jsonb-hotfix

Conversation

@garrytan
Copy link
Copy Markdown
Owner

Summary

Data-correctness hotfix for v0.12.0 Postgres-backed brains. PGLite users were unaffected. Bundles community PRs #187 (@knee5) and #175 (@leonardsellem) with expanded migration scope, schema audit (5 affected JSONB columns vs 3 originally reported), CI grep guard, and an E2E regression test that should have caught the original bug.

JSONB double-encode (Postgres only). Every ${JSON.stringify(value)}::jsonb interpolation in postgres-engine.ts and files.ts caused postgres.js v3 to stringify again on the wire, storing JSONB columns as quoted string literals. Every frontmatter->>'key' returned NULL on Postgres-backed brains. GIN indexes were inert. Switched to sql.json(value) (postgres.js native, OID 3802). Affected columns: pages.frontmatter, raw_data.data, ingest_log.pages_updated, files.metadata, page_versions.frontmatter. PGLite hid this bug entirely — different driver path.

splitBody truncation. Treated any standalone --- as timeline separator, causing 83% content truncation on wiki corpora (1,991-article wiki, 4,856 of 6,680 wikilinks lost). New behavior requires explicit sentinel: <!-- timeline -->, --- timeline ---, or --- directly before ## Timeline / ## History heading.

inferType wiki subtypes. Added /writing/, /wiki/analysis/, /wiki/guides/, /wiki/hardware/, /wiki/architecture/, /wiki/concepts/. Path order is most-specific-first so projects/blog/writing/essay.mdwriting.

pgvector NaN scores (Supabase). getEmbeddingsByChunkIds returned strings instead of Float32Array on Supabase, producing [NaN] cosine scores. Adds parseEmbedding() helper. Throws loud on malformed vectors (no silent NaN); returns null for non-vector strings.

Wikilink extraction. [[page]] and [[page|Display]] syntaxes now extracted alongside standard [text](page.md). resolveSlug() does ancestor-search for wiki KBs that omit ../.

Migration. New gbrain repair-jsonb command + v0_12_1 orchestrator (schema → repair → verify → record). Idempotent — re-running is a no-op. PGLite engines short-circuit cleanly.

CI grep guard at scripts/check-jsonb-pattern.sh fails the build if anyone reintroduces the ${JSON.stringify(x)}::jsonb pattern.

Test Coverage

[+] src/core/markdown.ts — splitBody/serializeMarkdown/inferType
    └── [★★★] markdown.test.ts: 10 splitBody + 5 round-trip + wiki/writing inferType cases
[+] src/core/postgres-engine.ts — sql.json() at putPage/putRawData/logIngest
    └── [★★★] e2e/postgres-jsonb.test.ts: 4 round-trip assertions on real Postgres
[+] src/core/utils.ts — parseEmbedding helper, rowToChunk delegation
    └── [★★★] utils.test.ts: F32A passthrough, pgvector string, null, throw on garbage
[+] src/commands/extract.ts — wikilink extraction, resolveSlug ancestor search
    └── [★★★] extract.test.ts existing coverage
[+] src/commands/files.ts:254 — sql.json metadata
    └── [★★★] e2e/postgres-jsonb.test.ts: files.metadata round-trip
[+] src/commands/migrations/v0_12_1.ts — JSONB repair orchestrator
    └── [★★★] migrations-v0_12_1.test.ts: registry, dry-run, phase exports
[+] src/commands/repair-jsonb.ts — repair command + PGLite short-circuit
    └── [★★★] repair-jsonb.test.ts (PGLite no-op) + e2e/postgres-jsonb.test.ts (real repair)
[+] scripts/check-jsonb-pattern.sh — CI grep guard
    └── [★★] Manual + wired into bun test

Coverage: 8/8 paths (100%). Tests: 1361 → 1415 (+54 new).

E2E suite: 120/120 pass against pgvector/pg16 Docker container. Unit suite: 1415/1415 pass. CI grep guard passes on this diff (no JSON.stringify(x)::jsonb patterns in src/).

Pre-Landing Review

No new issues found. Specialists already comprehensively covered by /plan-eng-review + Codex outside-voice review during planning (25+ findings, 3 material tensions adjudicated). repair-jsonb.ts uses sql.unsafe with table/column names from a hardcoded TARGETS array — no injection vector. Migration is idempotent. parseEmbedding throws loud on malformed input per Codex's no-silent-NaN requirement.

Plan Completion

All 24 planned items DONE. Scope reduced from 9-PR bundle to 2-PR hotfix per Codex outside-voice scope challenge. The remaining 7 PRs (#184, #177, #132, #114, #115, #119, #123) deferred to v0.12.2 follow-up wave per /Users/garrytan/.claude/plans/system-instruction-you-are-working-elegant-squid.md.

TODOS

No items in TODOS.md were specifically completed by this PR (it focused on BrainBench eval work).

Documentation

Documentation was synced to v0.12.1 in commit 998ef82. Six files updated to reflect the JSONB hotfix, splitBody sentinel contract, wiki inferType, and native wikilink extraction.

  • README.md — added gbrain repair-jsonb [--dry-run] to the ADMIN command reference
  • CLAUDE.md — registered new files, documented splitBody sentinel precedence and inferType wiki subtypes
  • INSTALL_FOR_AGENTS.md — fixed stale verification check counts, added v0.12.1 upgrade guidance
  • docs/GBRAIN_VERIFY.md — added check perf: parallelize hybrid search pipeline #8 (JSONB Frontmatter Integrity)
  • docs/UPGRADING_DOWNSTREAM_AGENTS.md — added v0.12.1 hotfix section explaining the splitBody contract change
  • skills/migrate/SKILL.md — documented native [[wikilink]] and [[wikilink|Display]] extraction

CHANGELOG.md was left untouched (already comprehensive). VERSION bumped to 0.12.1.

Test plan

  • All unit tests pass (1322/1322)
  • All E2E tests pass against real pgvector (120/120, including new test/e2e/postgres-jsonb.test.ts)
  • CI grep guard passes on current src/
  • PGLite repair-jsonb test confirms no DB connection / 0 rows reported
  • Migration dry-run skips all side-effect phases
  • Manual smoke on a real Postgres-backed brain: gbrain put a page with frontmatter, query frontmatter->>'key' returns the value
  • Manual smoke: gbrain repair-jsonb --dry-run against a brain with double-encoded rows reports the correct count

Attribution

Built on community PRs:

Both PRs reported the bugs and proposed the fixes. Codex outside-voice review during planning surfaced the missed page_versions.frontmatter propagation path, dropped the noisy-truncated-diagnostic anti-pattern from scope, and pushed for the engine-aware migration.

🤖 Generated with Claude Code

garrytan and others added 6 commits April 18, 2026 23:49
- splitBody now requires explicit timeline sentinel (<!-- timeline -->,
  --- timeline ---, or --- directly before ## Timeline / ## History).
  A bare --- in body text is a markdown horizontal rule, not a separator.
  This fixes the 83% content truncation @knee5 reported on a 1,991-article
  wiki where 4,856 of 6,680 wikilinks were lost.

- serializeMarkdown emits <!-- timeline --> sentinel for round-trip stability.

- inferType extended with /writing/, /wiki/analysis/, /wiki/guides/,
  /wiki/hardware/, /wiki/architecture/, /wiki/concepts/. Path order is
  most-specific-first so projects/blog/writing/essay.md → writing,
  not project.

- PageType union extended: writing, analysis, guide, hardware, architecture.

Updates test/import-file.test.ts to use the new sentinel.

Co-Authored-By: @knee5 (PR #187)
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Two related Postgres-string-typed-data bugs that PGLite hid:

1. JSONB double-encode (postgres-engine.ts:107,668,846 + files.ts:254):
   ${JSON.stringify(value)}::jsonb in postgres.js v3 stringified again
   on the wire, storing JSONB columns as quoted string literals. Every
   frontmatter->>'key' returned NULL on Postgres-backed brains; GIN
   indexes were inert. Switched to sql.json(value), which is the
   postgres.js-native JSONB encoder (Parameter with OID 3802).
   Affected columns: pages.frontmatter, raw_data.data,
   ingest_log.pages_updated, files.metadata. page_versions.frontmatter
   is downstream via INSERT...SELECT and propagates the fix.

2. pgvector embeddings returning as strings (utils.ts):
   getEmbeddingsByChunkIds returned "[0.1,0.2,...]" instead of
   Float32Array on Supabase, producing [NaN] cosine scores.
   Adds parseEmbedding() helper handling Float32Array, numeric arrays,
   and pgvector string format. Throws loud on malformed vectors
   (per Codex's no-silent-NaN requirement); returns null for
   non-vector strings (treated as "no embedding here"). rowToChunk
   delegates to parseEmbedding.

E2E regression test at test/e2e/postgres-jsonb.test.ts asserts
jsonb_typeof = 'object' AND col->>'k' returns expected scalar across
all 5 affected columns — the test that should have caught the original
bug. Runs in CI via the existing pgvector service.

Co-Authored-By: @knee5 (PR #187 — JSONB triple-fix)
Co-Authored-By: @leonardsellem (PR #175 — parseEmbedding)
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
extractMarkdownLinks now handles [[page]] and [[page|Display Text]]
alongside standard [text](page.md). For wiki KBs where authors omit
leading ../ (thinking in wiki-root-relative terms), resolveSlug
walks ancestor directories until it finds a matching slug.

Without this, wikilinks under tech/wiki/analysis/ targeting
[[../../finance/wiki/concepts/foo]] silently dangled when the
correct relative depth was 3 × ../ instead of 2.

Co-Authored-By: @knee5 (PR #187)
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
- New gbrain repair-jsonb command. Detects rows where
  jsonb_typeof(col) = 'string' and rewrites them via
  (col #>> '{}')::jsonb across 5 affected columns:
  pages.frontmatter, raw_data.data, ingest_log.pages_updated,
  files.metadata, page_versions.frontmatter. Idempotent — re-running
  is a no-op. PGLite engines short-circuit cleanly (the bug never
  affected the parameterized encode path PGLite uses). --dry-run
  shows what would be repaired; --json for scripting.

- New v0_12_1.ts migration orchestrator. Phases: schema → repair → verify.
  Modeled on v0_12_0 pattern, registered in migrations/index.ts.
  Runs automatically via gbrain upgrade / apply-migrations.

- CI grep guard at scripts/check-jsonb-pattern.sh fails the build if
  anyone reintroduces the ${JSON.stringify(x)}::jsonb interpolation
  pattern. Wired into bun test via package.json. Best-effort static
  analysis (multi-line and helper-wrapped variants are caught by the
  E2E round-trip test instead).

- Updates apply-migrations.test.ts expectations to account for the new
  v0.12.1 entry in the registry.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
- CLAUDE.md: document repair-jsonb command, v0_12_1 migration,
  splitBody sentinel contract, inferType wiki subtypes, CI grep
  guard, new test files (repair-jsonb, migrations-v0_12_1, markdown)
- README.md: add gbrain repair-jsonb to ADMIN command reference
- INSTALL_FOR_AGENTS.md: fix verification count (6 -> 7), add
  v0.12.1 upgrade guidance for Postgres brains
- docs/GBRAIN_VERIFY.md: add check #8 for JSONB integrity on
  Postgres-backed brains
- docs/UPGRADING_DOWNSTREAM_AGENTS.md: add v0.12.1 section with
  migration steps, splitBody contract, wiki subtype inference
- skills/migrate/SKILL.md: document native wikilink extraction
  via gbrain extract links (v0.12.1+)

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@garrytan garrytan merged commit c0b6219 into master Apr 18, 2026
4 checks passed
joedanz added a commit to joedanz/pbrain that referenced this pull request Apr 19, 2026
…an#196)

Data-correctness hotfix for Postgres-backed brains. Pulls forward the
v0.12.2 fix wave from upstream GBrain. PGLite brains were unaffected.

Three related Postgres-string-typed-data bugs that PGLite hid:

1. JSONB double-encode (postgres-engine.ts + files.ts): the
   ${JSON.stringify(value)}::jsonb interpolation pattern made
   postgres.js v3 stringify again on the wire, storing JSONB columns as
   quoted string literals. Every frontmatter->>'key' returned NULL on
   Postgres-backed brains; GIN indexes were inert. Fix: switch to
   sql.json(value), the postgres.js-native JSONB encoder. Affected:
   pages.frontmatter, raw_data.data, ingest_log.pages_updated,
   files.metadata, page_versions.frontmatter.
2. splitBody greedy --- match (markdown.ts): any standalone --- in body
   content was treated as a timeline separator, truncating wiki imports
   by up to 83% (reported by @knee5). Fix: require an explicit sentinel
   (<!-- timeline -->, --- timeline ---, or --- immediately before
   ## Timeline / ## History). Plain --- is a markdown horizontal rule.
3. parseEmbedding NaN scores (utils.ts): Supabase returned embedding
   columns as JSON strings; getEmbeddingsByChunkIds yielded NaN query
   scores. Fix: normalize via parseEmbedding.

Also adds /wiki/ subdirectory type inference (analysis, guides,
hardware, architecture, writing).

New:
- pbrain repair-jsonb [--dry-run] [--json] — standalone repair CLI
- src/commands/migrations/v0_12_2.ts — orchestrator (schema → repair
  → verify → record). Registered in migrations/index.ts.
- scripts/check-jsonb-pattern.sh — CI grep guard against regressions.
  Wired into `bun test` via the check:jsonb npm script.
- test/e2e/postgres-jsonb.test.ts — E2E round-trip regression.

Notes on deferred upstream work: upstream's v0.12.0 orchestrator
(knowledge-graph auto-wire) and v0_12_0.ts are NOT registered in this
fork's migration registry — PBrain has not adopted the knowledge graph
layer. The apply-migrations unit tests have been rewritten to assert
the planner's invariants against v0.12.2 instead of upstream's v0.11.0.

Original fixes contributed by:
- @knee5 (PR garrytan#187 — splitBody, inferType wiki, JSONB triple-fix)
- @leonardsellem (PR garrytan#175 — parseEmbedding, getEmbeddingsByChunkIds)

(cherry picked from commit c0b621923b641eae0e7d6228e50d9cdaa6bd97ae)

Co-Authored-By: Garry Tan <[email protected]>
Co-Authored-By: knee5 <[email protected]>
Co-Authored-By: leonardsellem <[email protected]>
mindmnml-del pushed a commit to mindmnml-del/gbrain that referenced this pull request Apr 27, 2026
Defense-in-depth for cosineSimilarity. parseEmbedding (landed in garrytan#196)
now guarantees valid Float32Array on the happy path, but cosineSimilarity
is a public export that can be called with data from other sources
(future embedding models with different dims, direct callers, test
fixtures). This makes the math non-explosive under three edge cases:

1. **Dimension mismatch**: bound the loop by Math.min(a.length, b.length)
   instead of a.length alone. Previously, if b was shorter, the tail
   would multiply undefined * undefined → NaN, which then poisoned the
   magB accumulator and the returned score. Now the comparison just
   runs over the common prefix.

2. **Non-finite denominator**: extremely large vectors can produce
   Infinity in magA/magB squared sums. sqrt(Infinity) * sqrt(Infinity)
   = Infinity, and the old `denom === 0` guard misses that case
   (Infinity !== 0). Guard with Number.isFinite(denom).

3. **Non-finite result**: final belt-and-suspenders check on dot/denom
   so no NaN or Infinity leaks into downstream score blending
   (0.7*rrf + 0.3*cosine) where NaN would propagate through every
   result the way garrytan#196 fixed at the data-ingest boundary.

Purely additive defensive coding. No behavior change on valid inputs
(same scores for dim-matched finite vectors). No API change. +5/-3.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant