Skip to content

feat: custom join scan planning (with simple execution)#3930

Merged
mdashti merged 123 commits intomainfrom
moe/join-planning
Jan 22, 2026
Merged

feat: custom join scan planning (with simple execution)#3930
mdashti merged 123 commits intomainfrom
moe/join-planning

Conversation

@mdashti
Copy link
Copy Markdown
Contributor

@mdashti mdashti commented Jan 15, 2026

Ticket(s) Closed

  • Closes #N/A

What

Implements the planning infrastructure for JoinScan, a PostgreSQL custom scan operator for JOIN queries with BM25 full-text search predicates. Includes a basic execution implementation to validate the planning logic.

Why

When joining tables where one side has a BM25 search predicate and the query has a LIMIT, PostgreSQL's native planner doesn't know that Tantivy can efficiently return top-N results (in score order). This PR lays the groundwork for optimized join execution by:

  1. Detecting join opportunities where BM25 indexes can accelerate the query
  2. Extracting and serializing all necessary planning information for execution
  3. Integrating with PostgreSQL's cost model and pathkey system

How

Planning (main focus):

  • Detects INNER JOINs with LIMIT where at least one side has a BM25 predicate
  • Extracts equi-join keys with type information (INTEGER, TEXT, UUID, NUMERIC, composite)
  • Extracts join-level predicates (OR/AND/NOT spanning both tables) into a serializable expression tree
  • Declares pathkeys for ORDER BY score to eliminate Sort nodes
  • Integrates with PostgreSQL's custom scan infrastructure (create_custom_path, plan_custom_path)

Execution (simple/unoptimized):

  • Basic driving-side hash join to validate planning correctness
  • TopN executor for streaming Tantivy results with LIMIT
  • Hash table with work_mem limit and nested loop fallback
  • Visibility checking for stale ctids after UPDATE

Tests

  • pg_regress:added test cases covering planning scenarios (join types, key types, predicates, cross joins, ORDER BY score)
  • Concurrent updates: Validates visibility handling under concurrent UPDATEs
  • Property-based (qgen): Generated queries comparing JoinScan results against PostgreSQL's native joins

@mdashti mdashti changed the base branch from main to stuhood.join-scan-skeleton January 15, 2026 09:46
@mdashti mdashti added the Do Not Cherry Pick PR should not be cherry-picked to other branches label Jan 15, 2026
- Add CompositeKey enum to store actual key values (not just hashes)
- Add KeyValue struct to store copied datum bytes for any PostgreSQL type
- Add JoinKeyInfo struct for runtime key extraction info
- Update JoinKeyPair to include type_oid, typlen, typbyval
- Update extract_join_conditions to capture type info from Vars
- Replace i64 hash key with CompositeKey for correct equality comparison
- Remove CROSS_JOIN_KEY constant in favor of CompositeKey::CrossJoin
- Add extract_composite_key and copy_datum_to_key_value helper functions
- Support varlena (TEXT, BYTEA), cstring, and fixed-length types

This fixes:
- Issue 1: Hash table key type limitation (was i64 only)
- Issue 2: Single join key only (now supports composite keys)
- Issue 3: Cross-join magic key collision (now uses distinct enum variant)
- Add extract_non_equijoin_quals helper to filter restrictlist
- Initialize join_qual_state and join_qual_econtext in begin_custom_scan
- Only create qual state when has_other_conditions is true
- Skip equi-join conditions (Var = Var) that are handled by hash lookup

This fixes Issue 4: join_qual_state was never initialized, causing
non-equijoin predicates to be silently ignored during execution.
Previously, if extract_join_level_conditions failed, it would silently
return None. Now it logs a debug1 message to help with troubleshooting
why JoinScan wasn't proposed for a particular query.
These methods were never called - PostgreSQL's Limit node handles limiting.
Remove the dead code rather than keeping it with #[allow(dead_code)].
Factor out the join-level predicate evaluation logic into a reusable
helper method used by both hash join and nested loop execution paths.
This eliminates ~40 lines of duplicated code.
Copy link
Copy Markdown
Collaborator

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Moe!

This is awesome work, but it would be great to simplify/restrict it before landing.

Comment thread pg_search/src/postgres/customscan/joinscan/planning.rs Outdated
Comment thread pg_search/src/postgres/customscan/joinscan/build.rs Outdated
Comment thread pg_search/src/postgres/customscan/joinscan/build.rs Outdated
Comment thread pg_search/src/postgres/customscan/joinscan/mod.rs
Comment thread pg_search/src/postgres/customscan/joinscan/build.rs Outdated
Comment thread pg_search/src/postgres/customscan/joinscan/mod.rs Outdated
Copy link
Copy Markdown
Contributor Author

@mdashti mdashti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stuhood Thanks for the review.

Comment thread pg_search/src/postgres/customscan/joinscan/planning.rs Outdated
Comment thread pg_search/src/postgres/customscan/joinscan/build.rs Outdated
Comment thread pg_search/src/postgres/customscan/joinscan/build.rs Outdated
Comment thread pg_search/src/postgres/customscan/joinscan/mod.rs
Comment thread pg_search/src/postgres/customscan/joinscan/build.rs Outdated
Comment thread pg_search/src/postgres/customscan/joinscan/mod.rs Outdated
@mdashti mdashti requested a review from stuhood January 22, 2026 01:21
Copy link
Copy Markdown
Collaborator

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot Moe!

Let's get it in and iterate.

Comment thread pg_search/src/postgres/customscan/joinscan/scan_state.rs Outdated
Comment thread pg_search/src/postgres/customscan/joinscan/scan_state.rs Outdated
Comment thread pg_search/src/postgres/customscan/joinscan/scan_state.rs Outdated
Comment thread pg_search/src/postgres/customscan/joinscan/scan_state.rs Outdated
Comment thread pg_search/src/postgres/customscan/joinscan/mod.rs Outdated
Comment thread pg_search/src/postgres/customscan/joinscan/mod.rs Outdated
Comment thread pg_search/src/postgres/heap.rs Outdated
Comment thread pg_search/src/postgres/customscan/joinscan/mod.rs Outdated
@mdashti mdashti merged commit b47b50b into main Jan 22, 2026
18 checks passed
@mdashti mdashti deleted the moe/join-planning branch January 22, 2026 21:16
stuhood added a commit that referenced this pull request Jan 28, 2026
)

## What

This change swaps from the hash join execution method which was added in
#3930 to explicitly using DataFusion's hash join.

Future changes will introduce DataFusion's optimizer (by producing
logical nodes rather than physical nodes) so that it can take advantage
of the sorted segments which will be provided by #3988.

## Why

As described in #3930, the implementation there was explicitly
temporary. We will be leaning in to using DataFusion to execute columnar
joins.

## Tests

The regression tests and proptests pass, with one exception: numeric
columns cannot safely be pulled up from fast fields currently (see
#2968).

Additionally, improved `qgen`'s handling of panics.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Do Not Cherry Pick PR should not be cherry-picked to other branches

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants