Skip to content

postgres-engine: scope search statement_timeout to the transaction#158

Open
garagon wants to merge 1 commit intogarrytan:masterfrom
garagon:security/postgres-pool-timeout-leak
Open

postgres-engine: scope search statement_timeout to the transaction#158
garagon wants to merge 1 commit intogarrytan:masterfrom
garagon:security/postgres-pool-timeout-leak

Conversation

@garagon
Copy link
Copy Markdown
Contributor

@garagon garagon commented Apr 16, 2026

Summary

searchKeyword and searchVector in src/core/postgres-engine.ts are the two hot-path search methods. Both run on a shared postgres.js client whose default pool size is 10 connections, and both try to cap query runtime with a statement_timeout GUC so a bad query can't hog a connection forever:

await sql`SET statement_timeout = '8s'`;
try {
  const rows = await sql`<search query>`;
  return rows.map(rowToSearchResult);
} finally {
  await sql`SET statement_timeout = '0'`;
}

The intent is right. The implementation leaks because every tagged-template call is an independent round-trip that picks an arbitrary connection from the pool. Under concurrency — which is exactly the shape of a hot search path — the three statements can land on different connections:

  • The SET modifies connection A. Connection A returns to the pool with statement_timeout = '8s' still set.
  • The search query runs on connection B (pool picked a different one).
  • The finally reset lands on connection C.

What the next caller sees depends on which connection the pool hands them:

  • If they grab A, any long-running query they run gets clipped at 8s — wrong cancellation for embed --all, large imports, admin queries, anything that's allowed to take time.
  • If the reset-to-zero landed on D, any future caller on D gets the guard disabled, so they can no longer rely on the cap.

An error between the SET and the finally makes it permanent. The leaked GUC sits on a pooled connection until the pool churns it out or the process restarts.

Reported as R6-F006 in the latest audit; the earlier audit filed the same pattern on searchKeyword alone as R4-F002, and the regression check found it still present at v0.10.1. This PR closes both.

Fix

Wrap each search query in sql.begin(async sql => …) and use SET LOCAL:

const rows = await sql.begin(async sql => {
  await sql`SET LOCAL statement_timeout = '8s'`;
  return await sql`<search query>`;
});
return rows.map(rowToSearchResult);

Two invariants hold now that didn't before:

  1. One connection for the whole transaction. postgres.js reserves a single connection for the body of sql.begin. The SET LOCAL, the query, and the implicit COMMIT all run on the same connection, so the timeout really applies to the query we care about.
  2. Scoped GUC, no manual reset. SET LOCAL is transaction-scoped; COMMIT restores the previous value automatically. ROLLBACK (triggered by any throw inside the callback) does the same. The explicit finally { SET = 0 } is gone because it cannot be needed.

No API change. Search semantics are identical: same 8s cap, same behaviour on valid and invalid queries, same return shape. embed --all, bulk import, and other long-running paths were never inside these methods and are unaffected.

Why a transaction rather than a second dedicated pool

A reserved single-connection pool for searches would also work but adds a second postgres.js client, a second set of connect / disconnect hooks, and a second thing to keep in sync with auth and RLS. A transaction wrapper is three extra lines of code, zero new config, and uses the library's intended primitive for scoped GUCs.

Reproduction

Before the fix, a concurrency smoke test makes the leak visible on a live pool:

// SETUP: two clients pointing at the same DB, one small pool
const sql = postgres(url, { max: 10 });

// Request A: the search method crashes after SET, before finally
await sql`SET statement_timeout = '8s'`;
throw new Error('boom');  // finally reset never runs

// Request B: runs a long query that should be allowed
await sql`SELECT pg_sleep(30)`;
// → query cancelled at 8s. That's the leak.

With sql.begin, the same crash path aborts the transaction on its own connection; the 8s GUC never survives the rollback.

Sibling review

Grepped src/core/postgres-engine.ts for every SET statement:

Line Statement Scope Status
previous 195 SET statement_timeout = '8s' in searchKeyword pool-wide (leak) fixed to SET LOCAL inside sql.begin
previous 226 SET statement_timeout = '0' in searchKeyword finally pool-wide (leak) removed; COMMIT resets
previous 245 SET statement_timeout = '8s' in searchVector pool-wide (leak) fixed to SET LOCAL inside sql.begin
previous 265 SET statement_timeout = '0' in searchVector finally pool-wide (leak) removed; COMMIT resets

No other SET statements against the pool elsewhere in src/. A grep for SET LOCAL in the rest of the codebase returned nothing, so the new pattern does not collide with existing usage.

Test results

Source-level guardrails — test/postgres-engine.test.ts

Live Postgres coverage lives in test/e2e/search-quality.test.ts and will keep exercising these methods end-to-end when DATABASE_URL is set. The new file is a fast, DB-free guardrail against future regressions:

  1. no bare SET statement_timeout statement survives — strips comments, then asserts no sql`SET statement_timeout` remains outside a SET LOCAL. If a future change reintroduces the anti-pattern, this turns red before review.
  2. searchKeyword wraps its query in sql.begin() — structural check on the method body.
  3. searchVector wraps its query in sql.begin() — same.
  4. Both methods use SET LOCAL.
  5. Neither method explicitly clears the timeout with SET statement_timeout = 0.

All five pass on this branch. Against master (the vulnerable code) the same file fails 5/5 — the tests catch the exact shape of the bug.

Full suite

bun test
 846 pass
 131 skip     (E2E without DATABASE_URL)
   0 fail
Ran 977 tests across 53 files.

What stayed in place

  • Public engine API (searchKeyword, searchVector) is byte-identical in signature and return type.
  • 8-second timeout and clampSearchLimit behaviour preserved.
  • Pool configuration untouched: max, idle_timeout, connect_timeout come from connect() as before.
  • No new dependency.

Files

 src/core/postgres-engine.ts  |  36 ++++++++---------
 test/postgres-engine.test.ts | 112 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 132 insertions(+), 16 deletions(-)

How to verify

git checkout security/postgres-pool-timeout-leak
bun install
bun test test/postgres-engine.test.ts   # 5 pass
bun test                                  # 846 pass, 0 fail
# Optional, requires DATABASE_URL:
bun run test:e2e                          # search-quality suite hits the new code paths

searchKeyword and searchVector run on a pooled postgres.js client
(max: 10 by default). The original code bounded each search with

  await sql`SET statement_timeout = '8s'`
  try { await sql`<query>` }
  finally { await sql`SET statement_timeout = '0'` }

but every tagged template is an independent round-trip that picks an
arbitrary connection from the pool. The SET, the query, and the reset
could all land on DIFFERENT connections. In practice the GUC sticks
to whichever connection ran the SET and then gets returned to the
pool — the next unrelated caller on that connection inherits the 8s
timeout (clipping legitimate long queries) or the reset-to-0 (disabling
the guard for whoever expected it). A crash in the middle leaves the
state set permanently.

Wrap each search in sql.begin(async sql => …). postgres.js reserves
a single connection for the transaction body, so the SET LOCAL, the
query, and the implicit COMMIT all run on the same connection. SET
LOCAL scopes the GUC to the transaction — COMMIT or ROLLBACK restores
the previous value automatically, regardless of the code path out.
Error paths can no longer leak the GUC.

No API change. Timeout value and semantics are identical (8s cap on
search queries, no effect on embed --all / bulk import which runs
outside these methods). Only one transaction per search — BEGIN +
COMMIT round-trips are negligible next to a ranked FTS or pgvector
query.

Also closes the earlier audit finding R4-F002 which reported the same
pattern on searchKeyword. This PR covers both searchKeyword and
searchVector so the pool-leak class is fully closed.

Tests (test/postgres-engine.test.ts, new file):
- No bare SET statement_timeout remains after stripping comments.
- searchKeyword and searchVector each wrap their query in sql.begin.
- Both use SET LOCAL.
- Neither explicitly clears the timeout with SET statement_timeout=0.

Source-level guardrails keep the fast unit suite DB-free. Live
Postgres coverage of the search path is in test/e2e/search-quality.test.ts,
which continues to exercise these methods end-to-end against
pgvector when DATABASE_URL is set.
garrytan added a commit that referenced this pull request Apr 19, 2026
Master shipped v0.12.1 (extract N+1 + migration timeout) and v0.12.2
(JSONB double-encode + splitBody + wiki types + parseEmbedding) while
this wave was mid-flight. Ships the remaining pieces as v0.12.3:

- sync deadlock (#132, @sunnnybala)
- statement_timeout scoping (#158, @garagon)
- Obsidian wikilinks + domain patterns (#187 slice, @knee5)
- gbrain orphans command (#187 slice, @knee5)
- tryParseEmbedding() availability helper
- doctor detection for jsonb_integrity + markdown_body_completeness

No schema, no migration, no data touch.

Co-Authored-By: Claude Opus 4.7 <[email protected]>
garrytan added a commit that referenced this pull request Apr 19, 2026
…kilinks, orphans (#216)

* fix(sync): remove nested transaction that deadlocks > 10 file syncs

sync.ts wraps the add/modify loop in engine.transaction(), and each
importFromContent inside opens another one. PGLite's
_runExclusiveTransaction is a non-reentrant mutex — the second call
queues on the mutex the first is holding, and the process hangs forever
in ep_poll. Reproduced with a 15-file commit: unpatched hangs, patched
runs in 3.4s. Fix drops the outer wrap; per-file atomicity is correct
anyway (one file's failure should not roll back the others).

(cherry picked from commit 4a1ac00)

* test(sync): regression guard for #132 top-level engine.transaction wrap

Reads src/commands/sync.ts verbatim and asserts no uncommented
engine.transaction() call appears above the add/modify loop. Protects
against silent reintroduction of the nested-mutex deadlock that hung
> 10-file syncs forever in ep_poll.

* feat(utils): tryParseEmbedding() skip+warn sibling for availability path

parseEmbedding() throws on structural corruption — right call for ingest/
migrate paths where silent skips would be data loss. Wrong call for
search/rescore paths where one corrupt row in 10K would kill every
query that touches it.

tryParseEmbedding() wraps parseEmbedding in try/catch: returns null on
any shape that would throw, warns once per session so the bad row is
visible in logs. Use it anywhere we'd rather degrade ranking than blow
up the whole query.

Retrofit postgres-engine.getEmbeddingsByChunkIds (the #175 slice call
site) — the 5-line rescore loop was the direct motivator. Keep the
throwing parseEmbedding() for everything else (pglite-engine rowToChunk,
migrate-engine round-trips, ingest).

* postgres-engine: scope search statement_timeout to the transaction

searchKeyword and searchVector run on a pooled postgres.js client
(max: 10 by default). The original code bounded each search with

  await sql`SET statement_timeout = '8s'`
  try { await sql`<query>` }
  finally { await sql`SET statement_timeout = '0'` }

but every tagged template is an independent round-trip that picks an
arbitrary connection from the pool. The SET, the query, and the reset
could all land on DIFFERENT connections. In practice the GUC sticks
to whichever connection ran the SET and then gets returned to the
pool — the next unrelated caller on that connection inherits the 8s
timeout (clipping legitimate long queries) or the reset-to-0 (disabling
the guard for whoever expected it). A crash in the middle leaves the
state set permanently.

Wrap each search in sql.begin(async sql => …). postgres.js reserves
a single connection for the transaction body, so the SET LOCAL, the
query, and the implicit COMMIT all run on the same connection. SET
LOCAL scopes the GUC to the transaction — COMMIT or ROLLBACK restores
the previous value automatically, regardless of the code path out.
Error paths can no longer leak the GUC.

No API change. Timeout value and semantics are identical (8s cap on
search queries, no effect on embed --all / bulk import which runs
outside these methods). Only one transaction per search — BEGIN +
COMMIT round-trips are negligible next to a ranked FTS or pgvector
query.

Also closes the earlier audit finding R4-F002 which reported the same
pattern on searchKeyword. This PR covers both searchKeyword and
searchVector so the pool-leak class is fully closed.

Tests (test/postgres-engine.test.ts, new file):
- No bare SET statement_timeout remains after stripping comments.
- searchKeyword and searchVector each wrap their query in sql.begin.
- Both use SET LOCAL.
- Neither explicitly clears the timeout with SET statement_timeout=0.

Source-level guardrails keep the fast unit suite DB-free. Live
Postgres coverage of the search path is in test/e2e/search-quality.test.ts,
which continues to exercise these methods end-to-end against
pgvector when DATABASE_URL is set.

(cherry picked from commit 6146c3b)

* feat(orphans): add gbrain orphans command for finding under-connected pages

Surfaces pages with zero inbound wikilinks. Essential for content
enrichment cycles in KBs with 1000+ pages. By default filters out
auto-generated pages, raw sources, and pseudo-pages where no inbound
links is expected; --include-pseudo to disable.

Supports text (grouped by domain), --json, --count outputs.
Also exposed as find_orphans MCP operation.

Tests cover basic detection, filtering, all output modes.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
(cherry picked from commit f50954f)

* feat(extract): support Obsidian wikilinks + wiki-style domain slugs in canonical extractor

extractEntityRefs now recognizes both syntaxes equally:
  [Name](people/slug)      -- upstream original
  [[people/slug|Name]]     -- Obsidian wikilink (new)

Extends DIR_PATTERN to include domain-organized wiki slugs used by
Karpathy-style knowledge bases:
  - entities  (legacy prefix some brains keep during migration)
  - projects  (gbrain canonical, was missing from regex)
  - tech, finance, personal, openclaw (domain-organized wiki roots)

Before this change, a 2,100-page brain with wikilinks throughout extracted
zero auto-links on put_page because the regex only matched markdown-style
[name](path). After: 1,377 new typed edges on a single extract --source db
pass over the same corpus.

Matches the behavior of the extract.ts filesystem walker (which already
handled wikilinks as of the wiki-markdown-compat fix wave), so the db and
fs sources now produce the same link graph from the same content.

Both patterns share the DIR_PATTERN constant so adding a new entity dir
only requires updating one string.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
(cherry picked from commit 1cfb156)

* feat(doctor): jsonb_integrity + markdown_body_completeness detection

Add two v0.12.1-era reliability checks to `gbrain doctor`:

- `jsonb_integrity` scans the 4 known write sites from the v0.12.0
  double-encode bug (pages.frontmatter, raw_data.data,
  ingest_log.pages_updated, files.metadata) and reports rows where
  jsonb_typeof(col) = 'string'. The fix hint points at
  `gbrain repair-jsonb` (the standalone repair command shipped in
  v0.12.1).

- `markdown_body_completeness` flags pages whose compiled_truth is
  <30% of the raw source content length when raw has multiple H2/H3
  boundaries. Heuristic only; suggests `gbrain sync --force` or
  `gbrain import --force <slug>`.

Also adds test/e2e/jsonb-roundtrip.test.ts — the regression coverage
that should have caught the original double-encode bug. Hits all four
write sites against real Postgres and asserts jsonb_typeof='object'
plus `->>'key'` returns the expected scalar.

Detection only: doctor diagnoses, `gbrain repair-jsonb` treats.
No overlap with the standalone repair path.

* chore: bump to v0.12.3 + changelog (reliability wave)

Master shipped v0.12.1 (extract N+1 + migration timeout) and v0.12.2
(JSONB double-encode + splitBody + wiki types + parseEmbedding) while
this wave was mid-flight. Ships the remaining pieces as v0.12.3:

- sync deadlock (#132, @sunnnybala)
- statement_timeout scoping (#158, @garagon)
- Obsidian wikilinks + domain patterns (#187 slice, @knee5)
- gbrain orphans command (#187 slice, @knee5)
- tryParseEmbedding() availability helper
- doctor detection for jsonb_integrity + markdown_body_completeness

No schema, no migration, no data touch.

Co-Authored-By: Claude Opus 4.7 <[email protected]>

* docs: update project documentation for v0.12.3

CLAUDE.md:
- Add src/commands/orphans.ts entry
- Expand src/commands/doctor.ts with v0.12.3 jsonb_integrity +
  markdown_body_completeness check descriptions
- Update src/core/link-extraction.ts to mention Obsidian wikilinks +
  extended DIR_PATTERN (entities/projects/tech/finance/personal/openclaw)
- Update src/core/utils.ts to mention tryParseEmbedding sibling
- Update src/core/postgres-engine.ts to note statement_timeout scoping +
  tryParseEmbedding usage in getEmbeddingsByChunkIds
- Add Key commands added in v0.12.3 section (orphans, doctor checks)
- Add test/orphans.test.ts, test/postgres-engine.test.ts, updated
  descriptions for test/sync.test.ts, test/doctor.test.ts,
  test/utils.test.ts
- Add test/e2e/jsonb-roundtrip.test.ts with note on intentional overlap
- Bump operation count from ~36 to ~41 (find_orphans shipped in v0.12.3)

README.md:
- Add gbrain orphans to ADMIN commands block

Co-Authored-By: Claude Opus 4.7 <[email protected]>

---------

Co-authored-by: sunnnybala <[email protected]>
Co-authored-by: Gustavo Aragon <[email protected]>
Co-authored-by: Clevin Canales <[email protected]>
Co-authored-by: Claude Sonnet 4.6 <[email protected]>
Co-authored-by: Clevin Canales <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant