postgres-engine: scope search statement_timeout to the transaction#158
Open
garagon wants to merge 1 commit intogarrytan:masterfrom
Open
postgres-engine: scope search statement_timeout to the transaction#158garagon wants to merge 1 commit intogarrytan:masterfrom
garagon wants to merge 1 commit intogarrytan:masterfrom
Conversation
searchKeyword and searchVector run on a pooled postgres.js client
(max: 10 by default). The original code bounded each search with
await sql`SET statement_timeout = '8s'`
try { await sql`<query>` }
finally { await sql`SET statement_timeout = '0'` }
but every tagged template is an independent round-trip that picks an
arbitrary connection from the pool. The SET, the query, and the reset
could all land on DIFFERENT connections. In practice the GUC sticks
to whichever connection ran the SET and then gets returned to the
pool — the next unrelated caller on that connection inherits the 8s
timeout (clipping legitimate long queries) or the reset-to-0 (disabling
the guard for whoever expected it). A crash in the middle leaves the
state set permanently.
Wrap each search in sql.begin(async sql => …). postgres.js reserves
a single connection for the transaction body, so the SET LOCAL, the
query, and the implicit COMMIT all run on the same connection. SET
LOCAL scopes the GUC to the transaction — COMMIT or ROLLBACK restores
the previous value automatically, regardless of the code path out.
Error paths can no longer leak the GUC.
No API change. Timeout value and semantics are identical (8s cap on
search queries, no effect on embed --all / bulk import which runs
outside these methods). Only one transaction per search — BEGIN +
COMMIT round-trips are negligible next to a ranked FTS or pgvector
query.
Also closes the earlier audit finding R4-F002 which reported the same
pattern on searchKeyword. This PR covers both searchKeyword and
searchVector so the pool-leak class is fully closed.
Tests (test/postgres-engine.test.ts, new file):
- No bare SET statement_timeout remains after stripping comments.
- searchKeyword and searchVector each wrap their query in sql.begin.
- Both use SET LOCAL.
- Neither explicitly clears the timeout with SET statement_timeout=0.
Source-level guardrails keep the fast unit suite DB-free. Live
Postgres coverage of the search path is in test/e2e/search-quality.test.ts,
which continues to exercise these methods end-to-end against
pgvector when DATABASE_URL is set.
garrytan
added a commit
that referenced
this pull request
Apr 19, 2026
Master shipped v0.12.1 (extract N+1 + migration timeout) and v0.12.2 (JSONB double-encode + splitBody + wiki types + parseEmbedding) while this wave was mid-flight. Ships the remaining pieces as v0.12.3: - sync deadlock (#132, @sunnnybala) - statement_timeout scoping (#158, @garagon) - Obsidian wikilinks + domain patterns (#187 slice, @knee5) - gbrain orphans command (#187 slice, @knee5) - tryParseEmbedding() availability helper - doctor detection for jsonb_integrity + markdown_body_completeness No schema, no migration, no data touch. Co-Authored-By: Claude Opus 4.7 <[email protected]>
4 tasks
garrytan
added a commit
that referenced
this pull request
Apr 19, 2026
…kilinks, orphans (#216) * fix(sync): remove nested transaction that deadlocks > 10 file syncs sync.ts wraps the add/modify loop in engine.transaction(), and each importFromContent inside opens another one. PGLite's _runExclusiveTransaction is a non-reentrant mutex — the second call queues on the mutex the first is holding, and the process hangs forever in ep_poll. Reproduced with a 15-file commit: unpatched hangs, patched runs in 3.4s. Fix drops the outer wrap; per-file atomicity is correct anyway (one file's failure should not roll back the others). (cherry picked from commit 4a1ac00) * test(sync): regression guard for #132 top-level engine.transaction wrap Reads src/commands/sync.ts verbatim and asserts no uncommented engine.transaction() call appears above the add/modify loop. Protects against silent reintroduction of the nested-mutex deadlock that hung > 10-file syncs forever in ep_poll. * feat(utils): tryParseEmbedding() skip+warn sibling for availability path parseEmbedding() throws on structural corruption — right call for ingest/ migrate paths where silent skips would be data loss. Wrong call for search/rescore paths where one corrupt row in 10K would kill every query that touches it. tryParseEmbedding() wraps parseEmbedding in try/catch: returns null on any shape that would throw, warns once per session so the bad row is visible in logs. Use it anywhere we'd rather degrade ranking than blow up the whole query. Retrofit postgres-engine.getEmbeddingsByChunkIds (the #175 slice call site) — the 5-line rescore loop was the direct motivator. Keep the throwing parseEmbedding() for everything else (pglite-engine rowToChunk, migrate-engine round-trips, ingest). * postgres-engine: scope search statement_timeout to the transaction searchKeyword and searchVector run on a pooled postgres.js client (max: 10 by default). The original code bounded each search with await sql`SET statement_timeout = '8s'` try { await sql`<query>` } finally { await sql`SET statement_timeout = '0'` } but every tagged template is an independent round-trip that picks an arbitrary connection from the pool. The SET, the query, and the reset could all land on DIFFERENT connections. In practice the GUC sticks to whichever connection ran the SET and then gets returned to the pool — the next unrelated caller on that connection inherits the 8s timeout (clipping legitimate long queries) or the reset-to-0 (disabling the guard for whoever expected it). A crash in the middle leaves the state set permanently. Wrap each search in sql.begin(async sql => …). postgres.js reserves a single connection for the transaction body, so the SET LOCAL, the query, and the implicit COMMIT all run on the same connection. SET LOCAL scopes the GUC to the transaction — COMMIT or ROLLBACK restores the previous value automatically, regardless of the code path out. Error paths can no longer leak the GUC. No API change. Timeout value and semantics are identical (8s cap on search queries, no effect on embed --all / bulk import which runs outside these methods). Only one transaction per search — BEGIN + COMMIT round-trips are negligible next to a ranked FTS or pgvector query. Also closes the earlier audit finding R4-F002 which reported the same pattern on searchKeyword. This PR covers both searchKeyword and searchVector so the pool-leak class is fully closed. Tests (test/postgres-engine.test.ts, new file): - No bare SET statement_timeout remains after stripping comments. - searchKeyword and searchVector each wrap their query in sql.begin. - Both use SET LOCAL. - Neither explicitly clears the timeout with SET statement_timeout=0. Source-level guardrails keep the fast unit suite DB-free. Live Postgres coverage of the search path is in test/e2e/search-quality.test.ts, which continues to exercise these methods end-to-end against pgvector when DATABASE_URL is set. (cherry picked from commit 6146c3b) * feat(orphans): add gbrain orphans command for finding under-connected pages Surfaces pages with zero inbound wikilinks. Essential for content enrichment cycles in KBs with 1000+ pages. By default filters out auto-generated pages, raw sources, and pseudo-pages where no inbound links is expected; --include-pseudo to disable. Supports text (grouped by domain), --json, --count outputs. Also exposed as find_orphans MCP operation. Tests cover basic detection, filtering, all output modes. Co-Authored-By: Claude Sonnet 4.6 <[email protected]> (cherry picked from commit f50954f) * feat(extract): support Obsidian wikilinks + wiki-style domain slugs in canonical extractor extractEntityRefs now recognizes both syntaxes equally: [Name](people/slug) -- upstream original [[people/slug|Name]] -- Obsidian wikilink (new) Extends DIR_PATTERN to include domain-organized wiki slugs used by Karpathy-style knowledge bases: - entities (legacy prefix some brains keep during migration) - projects (gbrain canonical, was missing from regex) - tech, finance, personal, openclaw (domain-organized wiki roots) Before this change, a 2,100-page brain with wikilinks throughout extracted zero auto-links on put_page because the regex only matched markdown-style [name](path). After: 1,377 new typed edges on a single extract --source db pass over the same corpus. Matches the behavior of the extract.ts filesystem walker (which already handled wikilinks as of the wiki-markdown-compat fix wave), so the db and fs sources now produce the same link graph from the same content. Both patterns share the DIR_PATTERN constant so adding a new entity dir only requires updating one string. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> (cherry picked from commit 1cfb156) * feat(doctor): jsonb_integrity + markdown_body_completeness detection Add two v0.12.1-era reliability checks to `gbrain doctor`: - `jsonb_integrity` scans the 4 known write sites from the v0.12.0 double-encode bug (pages.frontmatter, raw_data.data, ingest_log.pages_updated, files.metadata) and reports rows where jsonb_typeof(col) = 'string'. The fix hint points at `gbrain repair-jsonb` (the standalone repair command shipped in v0.12.1). - `markdown_body_completeness` flags pages whose compiled_truth is <30% of the raw source content length when raw has multiple H2/H3 boundaries. Heuristic only; suggests `gbrain sync --force` or `gbrain import --force <slug>`. Also adds test/e2e/jsonb-roundtrip.test.ts — the regression coverage that should have caught the original double-encode bug. Hits all four write sites against real Postgres and asserts jsonb_typeof='object' plus `->>'key'` returns the expected scalar. Detection only: doctor diagnoses, `gbrain repair-jsonb` treats. No overlap with the standalone repair path. * chore: bump to v0.12.3 + changelog (reliability wave) Master shipped v0.12.1 (extract N+1 + migration timeout) and v0.12.2 (JSONB double-encode + splitBody + wiki types + parseEmbedding) while this wave was mid-flight. Ships the remaining pieces as v0.12.3: - sync deadlock (#132, @sunnnybala) - statement_timeout scoping (#158, @garagon) - Obsidian wikilinks + domain patterns (#187 slice, @knee5) - gbrain orphans command (#187 slice, @knee5) - tryParseEmbedding() availability helper - doctor detection for jsonb_integrity + markdown_body_completeness No schema, no migration, no data touch. Co-Authored-By: Claude Opus 4.7 <[email protected]> * docs: update project documentation for v0.12.3 CLAUDE.md: - Add src/commands/orphans.ts entry - Expand src/commands/doctor.ts with v0.12.3 jsonb_integrity + markdown_body_completeness check descriptions - Update src/core/link-extraction.ts to mention Obsidian wikilinks + extended DIR_PATTERN (entities/projects/tech/finance/personal/openclaw) - Update src/core/utils.ts to mention tryParseEmbedding sibling - Update src/core/postgres-engine.ts to note statement_timeout scoping + tryParseEmbedding usage in getEmbeddingsByChunkIds - Add Key commands added in v0.12.3 section (orphans, doctor checks) - Add test/orphans.test.ts, test/postgres-engine.test.ts, updated descriptions for test/sync.test.ts, test/doctor.test.ts, test/utils.test.ts - Add test/e2e/jsonb-roundtrip.test.ts with note on intentional overlap - Bump operation count from ~36 to ~41 (find_orphans shipped in v0.12.3) README.md: - Add gbrain orphans to ADMIN commands block Co-Authored-By: Claude Opus 4.7 <[email protected]> --------- Co-authored-by: sunnnybala <[email protected]> Co-authored-by: Gustavo Aragon <[email protected]> Co-authored-by: Clevin Canales <[email protected]> Co-authored-by: Claude Sonnet 4.6 <[email protected]> Co-authored-by: Clevin Canales <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
searchKeywordandsearchVectorinsrc/core/postgres-engine.tsare the two hot-path search methods. Both run on a shared postgres.js client whose default pool size is 10 connections, and both try to cap query runtime with astatement_timeoutGUC so a bad query can't hog a connection forever:The intent is right. The implementation leaks because every tagged-template call is an independent round-trip that picks an arbitrary connection from the pool. Under concurrency — which is exactly the shape of a hot search path — the three statements can land on different connections:
SETmodifies connection A. Connection A returns to the pool withstatement_timeout = '8s'still set.finallyreset lands on connection C.What the next caller sees depends on which connection the pool hands them:
embed --all, large imports, admin queries, anything that's allowed to take time.An error between the
SETand thefinallymakes it permanent. The leaked GUC sits on a pooled connection until the pool churns it out or the process restarts.Reported as R6-F006 in the latest audit; the earlier audit filed the same pattern on
searchKeywordalone as R4-F002, and the regression check found it still present at v0.10.1. This PR closes both.Fix
Wrap each search query in
sql.begin(async sql => …)and useSET LOCAL:Two invariants hold now that didn't before:
sql.begin. TheSET LOCAL, the query, and the implicitCOMMITall run on the same connection, so the timeout really applies to the query we care about.SET LOCALis transaction-scoped;COMMITrestores the previous value automatically.ROLLBACK(triggered by any throw inside the callback) does the same. The explicitfinally { SET = 0 }is gone because it cannot be needed.No API change. Search semantics are identical: same 8s cap, same behaviour on valid and invalid queries, same return shape.
embed --all, bulk import, and other long-running paths were never inside these methods and are unaffected.Why a transaction rather than a second dedicated pool
A reserved single-connection pool for searches would also work but adds a second postgres.js client, a second set of connect / disconnect hooks, and a second thing to keep in sync with auth and RLS. A transaction wrapper is three extra lines of code, zero new config, and uses the library's intended primitive for scoped GUCs.
Reproduction
Before the fix, a concurrency smoke test makes the leak visible on a live pool:
With
sql.begin, the same crash path aborts the transaction on its own connection; the 8s GUC never survives the rollback.Sibling review
Grepped
src/core/postgres-engine.tsfor everySETstatement:SET statement_timeout = '8s'in searchKeywordSET LOCALinsidesql.beginSET statement_timeout = '0'in searchKeyword finallySET statement_timeout = '8s'in searchVectorSET LOCALinsidesql.beginSET statement_timeout = '0'in searchVector finallyNo other
SETstatements against the pool elsewhere insrc/. A grep forSET LOCALin the rest of the codebase returned nothing, so the new pattern does not collide with existing usage.Test results
Source-level guardrails —
test/postgres-engine.test.tsLive Postgres coverage lives in
test/e2e/search-quality.test.tsand will keep exercising these methods end-to-end whenDATABASE_URLis set. The new file is a fast, DB-free guardrail against future regressions:no bare SET statement_timeout statement survives— strips comments, then asserts nosql`SET statement_timeout`remains outside aSET LOCAL. If a future change reintroduces the anti-pattern, this turns red before review.searchKeyword wraps its query in sql.begin()— structural check on the method body.searchVector wraps its query in sql.begin()— same.SET LOCAL.SET statement_timeout = 0.All five pass on this branch. Against master (the vulnerable code) the same file fails 5/5 — the tests catch the exact shape of the bug.
Full suite
What stayed in place
searchKeyword,searchVector) is byte-identical in signature and return type.clampSearchLimitbehaviour preserved.max,idle_timeout,connect_timeoutcome fromconnect()as before.Files
How to verify