fix: JSONB double-encode + splitBody wiki + parseEmbedding (v0.12.1)#196
Merged
fix: JSONB double-encode + splitBody wiki + parseEmbedding (v0.12.1)#196
Conversation
- splitBody now requires explicit timeline sentinel (<!-- timeline -->, --- timeline ---, or --- directly before ## Timeline / ## History). A bare --- in body text is a markdown horizontal rule, not a separator. This fixes the 83% content truncation @knee5 reported on a 1,991-article wiki where 4,856 of 6,680 wikilinks were lost. - serializeMarkdown emits <!-- timeline --> sentinel for round-trip stability. - inferType extended with /writing/, /wiki/analysis/, /wiki/guides/, /wiki/hardware/, /wiki/architecture/, /wiki/concepts/. Path order is most-specific-first so projects/blog/writing/essay.md → writing, not project. - PageType union extended: writing, analysis, guide, hardware, architecture. Updates test/import-file.test.ts to use the new sentinel. Co-Authored-By: @knee5 (PR #187) Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Two related Postgres-string-typed-data bugs that PGLite hid:
1. JSONB double-encode (postgres-engine.ts:107,668,846 + files.ts:254):
${JSON.stringify(value)}::jsonb in postgres.js v3 stringified again
on the wire, storing JSONB columns as quoted string literals. Every
frontmatter->>'key' returned NULL on Postgres-backed brains; GIN
indexes were inert. Switched to sql.json(value), which is the
postgres.js-native JSONB encoder (Parameter with OID 3802).
Affected columns: pages.frontmatter, raw_data.data,
ingest_log.pages_updated, files.metadata. page_versions.frontmatter
is downstream via INSERT...SELECT and propagates the fix.
2. pgvector embeddings returning as strings (utils.ts):
getEmbeddingsByChunkIds returned "[0.1,0.2,...]" instead of
Float32Array on Supabase, producing [NaN] cosine scores.
Adds parseEmbedding() helper handling Float32Array, numeric arrays,
and pgvector string format. Throws loud on malformed vectors
(per Codex's no-silent-NaN requirement); returns null for
non-vector strings (treated as "no embedding here"). rowToChunk
delegates to parseEmbedding.
E2E regression test at test/e2e/postgres-jsonb.test.ts asserts
jsonb_typeof = 'object' AND col->>'k' returns expected scalar across
all 5 affected columns — the test that should have caught the original
bug. Runs in CI via the existing pgvector service.
Co-Authored-By: @knee5 (PR #187 — JSONB triple-fix)
Co-Authored-By: @leonardsellem (PR #175 — parseEmbedding)
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
extractMarkdownLinks now handles [[page]] and [[page|Display Text]] alongside standard [text](page.md). For wiki KBs where authors omit leading ../ (thinking in wiki-root-relative terms), resolveSlug walks ancestor directories until it finds a matching slug. Without this, wikilinks under tech/wiki/analysis/ targeting [[../../finance/wiki/concepts/foo]] silently dangled when the correct relative depth was 3 × ../ instead of 2. Co-Authored-By: @knee5 (PR #187) Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
- New gbrain repair-jsonb command. Detects rows where
jsonb_typeof(col) = 'string' and rewrites them via
(col #>> '{}')::jsonb across 5 affected columns:
pages.frontmatter, raw_data.data, ingest_log.pages_updated,
files.metadata, page_versions.frontmatter. Idempotent — re-running
is a no-op. PGLite engines short-circuit cleanly (the bug never
affected the parameterized encode path PGLite uses). --dry-run
shows what would be repaired; --json for scripting.
- New v0_12_1.ts migration orchestrator. Phases: schema → repair → verify.
Modeled on v0_12_0 pattern, registered in migrations/index.ts.
Runs automatically via gbrain upgrade / apply-migrations.
- CI grep guard at scripts/check-jsonb-pattern.sh fails the build if
anyone reintroduces the ${JSON.stringify(x)}::jsonb interpolation
pattern. Wired into bun test via package.json. Best-effort static
analysis (multi-line and helper-wrapped variants are caught by the
E2E round-trip test instead).
- Updates apply-migrations.test.ts expectations to account for the new
v0.12.1 entry in the registry.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
- CLAUDE.md: document repair-jsonb command, v0_12_1 migration, splitBody sentinel contract, inferType wiki subtypes, CI grep guard, new test files (repair-jsonb, migrations-v0_12_1, markdown) - README.md: add gbrain repair-jsonb to ADMIN command reference - INSTALL_FOR_AGENTS.md: fix verification count (6 -> 7), add v0.12.1 upgrade guidance for Postgres brains - docs/GBRAIN_VERIFY.md: add check #8 for JSONB integrity on Postgres-backed brains - docs/UPGRADING_DOWNSTREAM_AGENTS.md: add v0.12.1 section with migration steps, splitBody contract, wiki subtype inference - skills/migrate/SKILL.md: document native wikilink extraction via gbrain extract links (v0.12.1+) Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
This was referenced Apr 18, 2026
# Conflicts: # CHANGELOG.md # CLAUDE.md
joedanz
added a commit
to joedanz/pbrain
that referenced
this pull request
Apr 19, 2026
…an#196) Data-correctness hotfix for Postgres-backed brains. Pulls forward the v0.12.2 fix wave from upstream GBrain. PGLite brains were unaffected. Three related Postgres-string-typed-data bugs that PGLite hid: 1. JSONB double-encode (postgres-engine.ts + files.ts): the ${JSON.stringify(value)}::jsonb interpolation pattern made postgres.js v3 stringify again on the wire, storing JSONB columns as quoted string literals. Every frontmatter->>'key' returned NULL on Postgres-backed brains; GIN indexes were inert. Fix: switch to sql.json(value), the postgres.js-native JSONB encoder. Affected: pages.frontmatter, raw_data.data, ingest_log.pages_updated, files.metadata, page_versions.frontmatter. 2. splitBody greedy --- match (markdown.ts): any standalone --- in body content was treated as a timeline separator, truncating wiki imports by up to 83% (reported by @knee5). Fix: require an explicit sentinel (<!-- timeline -->, --- timeline ---, or --- immediately before ## Timeline / ## History). Plain --- is a markdown horizontal rule. 3. parseEmbedding NaN scores (utils.ts): Supabase returned embedding columns as JSON strings; getEmbeddingsByChunkIds yielded NaN query scores. Fix: normalize via parseEmbedding. Also adds /wiki/ subdirectory type inference (analysis, guides, hardware, architecture, writing). New: - pbrain repair-jsonb [--dry-run] [--json] — standalone repair CLI - src/commands/migrations/v0_12_2.ts — orchestrator (schema → repair → verify → record). Registered in migrations/index.ts. - scripts/check-jsonb-pattern.sh — CI grep guard against regressions. Wired into `bun test` via the check:jsonb npm script. - test/e2e/postgres-jsonb.test.ts — E2E round-trip regression. Notes on deferred upstream work: upstream's v0.12.0 orchestrator (knowledge-graph auto-wire) and v0_12_0.ts are NOT registered in this fork's migration registry — PBrain has not adopted the knowledge graph layer. The apply-migrations unit tests have been rewritten to assert the planner's invariants against v0.12.2 instead of upstream's v0.11.0. Original fixes contributed by: - @knee5 (PR garrytan#187 — splitBody, inferType wiki, JSONB triple-fix) - @leonardsellem (PR garrytan#175 — parseEmbedding, getEmbeddingsByChunkIds) (cherry picked from commit c0b621923b641eae0e7d6228e50d9cdaa6bd97ae) Co-Authored-By: Garry Tan <[email protected]> Co-Authored-By: knee5 <[email protected]> Co-Authored-By: leonardsellem <[email protected]>
7 tasks
Merged
4 tasks
This was referenced Apr 21, 2026
mindmnml-del
pushed a commit
to mindmnml-del/gbrain
that referenced
this pull request
Apr 27, 2026
Defense-in-depth for cosineSimilarity. parseEmbedding (landed in garrytan#196) now guarantees valid Float32Array on the happy path, but cosineSimilarity is a public export that can be called with data from other sources (future embedding models with different dims, direct callers, test fixtures). This makes the math non-explosive under three edge cases: 1. **Dimension mismatch**: bound the loop by Math.min(a.length, b.length) instead of a.length alone. Previously, if b was shorter, the tail would multiply undefined * undefined → NaN, which then poisoned the magB accumulator and the returned score. Now the comparison just runs over the common prefix. 2. **Non-finite denominator**: extremely large vectors can produce Infinity in magA/magB squared sums. sqrt(Infinity) * sqrt(Infinity) = Infinity, and the old `denom === 0` guard misses that case (Infinity !== 0). Guard with Number.isFinite(denom). 3. **Non-finite result**: final belt-and-suspenders check on dot/denom so no NaN or Infinity leaks into downstream score blending (0.7*rrf + 0.3*cosine) where NaN would propagate through every result the way garrytan#196 fixed at the data-ingest boundary. Purely additive defensive coding. No behavior change on valid inputs (same scores for dim-matched finite vectors). No API change. +5/-3.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Data-correctness hotfix for v0.12.0 Postgres-backed brains. PGLite users were unaffected. Bundles community PRs #187 (@knee5) and #175 (@leonardsellem) with expanded migration scope, schema audit (5 affected JSONB columns vs 3 originally reported), CI grep guard, and an E2E regression test that should have caught the original bug.
JSONB double-encode (Postgres only). Every
${JSON.stringify(value)}::jsonbinterpolation in postgres-engine.ts and files.ts caused postgres.js v3 to stringify again on the wire, storing JSONB columns as quoted string literals. Everyfrontmatter->>'key'returned NULL on Postgres-backed brains. GIN indexes were inert. Switched tosql.json(value)(postgres.js native, OID 3802). Affected columns:pages.frontmatter,raw_data.data,ingest_log.pages_updated,files.metadata,page_versions.frontmatter. PGLite hid this bug entirely — different driver path.splitBody truncation. Treated any standalone
---as timeline separator, causing 83% content truncation on wiki corpora (1,991-article wiki, 4,856 of 6,680 wikilinks lost). New behavior requires explicit sentinel:<!-- timeline -->,--- timeline ---, or---directly before## Timeline/## Historyheading.inferType wiki subtypes. Added
/writing/,/wiki/analysis/,/wiki/guides/,/wiki/hardware/,/wiki/architecture/,/wiki/concepts/. Path order is most-specific-first soprojects/blog/writing/essay.md→writing.pgvector NaN scores (Supabase).
getEmbeddingsByChunkIdsreturned strings instead ofFloat32Arrayon Supabase, producing[NaN]cosine scores. AddsparseEmbedding()helper. Throws loud on malformed vectors (no silent NaN); returns null for non-vector strings.Wikilink extraction.
[[page]]and[[page|Display]]syntaxes now extracted alongside standard[text](page.md).resolveSlug()does ancestor-search for wiki KBs that omit../.Migration. New
gbrain repair-jsonbcommand +v0_12_1orchestrator (schema → repair → verify → record). Idempotent — re-running is a no-op. PGLite engines short-circuit cleanly.CI grep guard at
scripts/check-jsonb-pattern.shfails the build if anyone reintroduces the${JSON.stringify(x)}::jsonbpattern.Test Coverage
E2E suite: 120/120 pass against pgvector/pg16 Docker container. Unit suite: 1415/1415 pass. CI grep guard passes on this diff (no
JSON.stringify(x)::jsonbpatterns in src/).Pre-Landing Review
No new issues found. Specialists already comprehensively covered by
/plan-eng-review+ Codex outside-voice review during planning (25+ findings, 3 material tensions adjudicated).repair-jsonb.tsusessql.unsafewith table/column names from a hardcodedTARGETSarray — no injection vector. Migration is idempotent.parseEmbeddingthrows loud on malformed input per Codex's no-silent-NaN requirement.Plan Completion
All 24 planned items DONE. Scope reduced from 9-PR bundle to 2-PR hotfix per Codex outside-voice scope challenge. The remaining 7 PRs (#184, #177, #132, #114, #115, #119, #123) deferred to v0.12.2 follow-up wave per
/Users/garrytan/.claude/plans/system-instruction-you-are-working-elegant-squid.md.TODOS
No items in
TODOS.mdwere specifically completed by this PR (it focused on BrainBench eval work).Documentation
Documentation was synced to v0.12.1 in commit
998ef82. Six files updated to reflect the JSONB hotfix, splitBody sentinel contract, wiki inferType, and native wikilink extraction.gbrain repair-jsonb [--dry-run]to the ADMIN command reference[[wikilink]]and[[wikilink|Display]]extractionCHANGELOG.md was left untouched (already comprehensive). VERSION bumped to 0.12.1.
Test plan
test/e2e/postgres-jsonb.test.ts)gbrain puta page with frontmatter, queryfrontmatter->>'key'returns the valuegbrain repair-jsonb --dry-runagainst a brain with double-encoded rows reports the correct countAttribution
Built on community PRs:
parseEmbedding()helper,getEmbeddingsByChunkIdsfix.Both PRs reported the bugs and proposed the fixes. Codex outside-voice review during planning surfaced the missed
page_versions.frontmatterpropagation path, dropped the noisy-truncated-diagnostic anti-pattern from scope, and pushed for the engine-aware migration.🤖 Generated with Claude Code