Skip to content

chore: Use DataFusion's optimizer in the join scan#4009

Merged
stuhood merged 1 commit intomainfrom
stuhood.optimized-plan
Feb 2, 2026
Merged

chore: Use DataFusion's optimizer in the join scan#4009
stuhood merged 1 commit intomainfrom
stuhood.optimized-plan

Conversation

@stuhood
Copy link
Copy Markdown
Collaborator

@stuhood stuhood commented Jan 28, 2026

Ticket(s) Closed

What

This change migrates the joinscan from manual physical plan construction to using DataFusion logical plans, which are then optimized into physical plans.

To do so, it adds a TableProvider implementation to allow DataFusion to natively scan and filter using the previously extracted scan over fast fields.

It additionally rearranges the code a bit to ensure separation between planning and execution: execution state and DataFusion logical planning are isolated to scan_state.rs (over time we might frontload and serialize more of DataFusion's logical plan during planning time, but for now none of that information is needed in the CustomPath that we produce). JoinSideInfo was lifted up into pg_search/src/scan as ScanInfo.

Why

In followup changes, we will give DataFusion's optimizer a lot more to work with:

Tests

Existing tests pass, and were lightly expanded to cover some new edge cases.

One join benchmark is marginally faster, one is massively slower, others are unaffected. This is expected for now: more performance work is on the way.

@stuhood stuhood added benchmark-queries Add this label to a PR to run the benchmark-queries datasets. Automatically removed. Do Not Cherry Pick PR should not be cherry-picked to other branches labels Jan 28, 2026
@github-actions github-actions bot removed the benchmark-queries Add this label to a PR to run the benchmark-queries datasets. Automatically removed. label Jan 28, 2026
Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pg_search 'logs' Query Performance

Details
Benchmark suite Current: a7dcac7 Previous: c0c03ea Ratio
bucket-expr-filter 7978.3885 median ms 7988.855 median ms 1.00
bucket-expr-filter - alternative 1 7904.639499999999 median ms 7895.584 median ms 1.00
bucket-numeric-filter 2083.355 median ms 2138.3869999999997 median ms 0.97
bucket-numeric-filter - alternative 1 275.3455 median ms 282.9895 median ms 0.97
bucket-numeric-filter - alternative 2 104.254 median ms 93.3485 median ms 1.12
bucket-numeric-filter - alternative 3 275.9875 median ms 286.312 median ms 0.96
bucket-numeric-filter - alternative 4 413.4015 median ms 421.4335 median ms 0.98
bucket-numeric-nofilter 2083.3 median ms 2126.266 median ms 0.98
bucket-numeric-nofilter - alternative 1 274.188 median ms 284.9495 median ms 0.96
bucket-numeric-nofilter - alternative 2 103.832 median ms 91.10050000000001 median ms 1.14
bucket-numeric-nofilter - alternative 3 276.20500000000004 median ms 285.5245 median ms 0.97
bucket-numeric-nofilter - alternative 4 414.61800000000005 median ms 501.655 median ms 0.83
bucket-string-filter 3220.92 median ms 3145.9365 median ms 1.02
bucket-string-filter - alternative 1 234.222 median ms 251.256 median ms 0.93
bucket-string-filter - alternative 2 67.6215 median ms 63.38 median ms 1.07
bucket-string-filter - alternative 3 235.495 median ms 256.0325 median ms 0.92
bucket-string-filter - alternative 4 363.77099999999996 median ms 372.4565 median ms 0.98
bucket-string-nofilter 3217.401 median ms 3152.049 median ms 1.02
bucket-string-nofilter - alternative 1 233.7165 median ms 253.49200000000002 median ms 0.92
bucket-string-nofilter - alternative 2 67.3075 median ms 62.6375 median ms 1.07
bucket-string-nofilter - alternative 3 238.3095 median ms 256.286 median ms 0.93
bucket-string-nofilter - alternative 4 362.72299999999996 median ms 372.3595 median ms 0.97
cardinality 15036.7125 median ms 15078.462 median ms 1.00
cardinality - alternative 1 1985.4895000000001 median ms 2040.4175 median ms 0.97
cardinality - alternative 2 272.46299999999997 median ms 284.881 median ms 0.96
cardinality - alternative 3 103.939 median ms 92.691 median ms 1.12
cardinality - alternative 4 274.2915 median ms 286.605 median ms 0.96
cardinality - alternative 5 276.2255 median ms 286.668 median ms 0.96
count-filter 211.2425 median ms 214.6745 median ms 0.98
count-filter - alternative 1 125.467 median ms 126.6285 median ms 0.99
count-filter - alternative 2 88.71100000000001 median ms 86.721 median ms 1.02
count-filter - alternative 3 125.9745 median ms 128.86950000000002 median ms 0.98
count-filter - alternative 4 125.016 median ms 126.41149999999999 median ms 0.99
count-nofilter 746.6859999999999 median ms 712.9960000000001 median ms 1.05
count-nofilter - alternative 1 280.2415 median ms 300.068 median ms 0.93
count-nofilter - alternative 2 155.8965 median ms 151.713 median ms 1.03
count-nofilter - alternative 3 280.07849999999996 median ms 302.578 median ms 0.93
count-nofilter - alternative 4 280.8695 median ms 302.63300000000004 median ms 0.93
filtered-highcard 5.867 median ms 5.919499999999999 median ms 0.99
filtered-lowcard 5.8095 median ms 5.777 median ms 1.01
filtered_json-range 7.4495000000000005 median ms 7.3085 median ms 1.02
filtered_json 5.8405000000000005 median ms 5.955500000000001 median ms 0.98
highlighting 8.508 median ms 8.3375 median ms 1.02
regex-and-heap 6009.5064999999995 median ms 6083.242 median ms 0.99
top_n-agg-avg 426.5395 median ms 431.987 median ms 0.99
top_n-agg-bucket-string 388.071 median ms 392.05949999999996 median ms 0.99
top_n-agg-count 431.08349999999996 median ms 433.543 median ms 0.99
top_n-compound 76.6485 median ms 72.4365 median ms 1.06
top_n-numeric-highcard 59.083 median ms 55.4505 median ms 1.07
top_n-numeric-lowcard 43.952 median ms 40.815 median ms 1.08
top_n-score-asc 93.7445 median ms 91.1795 median ms 1.03
top_n-score-desc 90.79599999999999 median ms 79.4495 median ms 1.14
top_n-string 44.8765 median ms 43.634 median ms 1.03

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pg_search 'docs' Query Performance

Details
Benchmark suite Current: a7dcac7 Previous: c0c03ea Ratio
aggregate_sort 13753.4915 median ms 13782.232499999998 median ms 1.00
aggregate_sort - alternative 1 13774.423999999999 median ms 13900.938999999998 median ms 0.99
disjunctive_search 562.0955 median ms 559.8895 median ms 1.00
disjunctive_search - alternative 1 565.6645 median ms 594.2165 median ms 0.95
distinct_parent_sort 3421.5685000000003 median ms 3622.2825000000003 median ms 0.94
distinct_parent_sort - alternative 1 3457.191 median ms 3614.009 median ms 0.96
foreign_filter_local_sort 147.711 median ms 142.542 median ms 1.04
foreign_filter_local_sort - alternative 1 4591.700999999999 median ms 7086.736000000001 median ms 0.65
hierarchical_content-no-scores-large 1194.5214999999998 median ms 1179.8355 median ms 1.01
hierarchical_content-no-scores-small 647.4580000000001 median ms 660.4815 median ms 0.98
hierarchical_content-scores-large 1470.1855 median ms 1458.6545 median ms 1.01
hierarchical_content-scores-large - alternative 1 713.669 median ms 722.751 median ms 0.99
hierarchical_content-scores-small 683.6655000000001 median ms 689.8810000000001 median ms 0.99
paging-string-max 20.837 median ms 19.355 median ms 1.08
paging-string-median 42.73950000000001 median ms 42.199 median ms 1.01
paging-string-min 53.372 median ms 53.232 median ms 1.00
permissioned_search 704.25 median ms 719.5085 median ms 0.98
permissioned_search - alternative 1 27483.391 median ms 1560.5149999999999 median ms 17.61
semi_join_filter 593.473 median ms 588.9515 median ms 1.01
semi_join_filter - alternative 1 50969.813500000004 median ms 31780.788999999997 median ms 1.60

This comment was automatically generated by workflow using github-action-benchmark.

@stuhood stuhood force-pushed the stuhood.optimized-plan branch from b0043b7 to b7ae6b8 Compare January 28, 2026 05:22
@stuhood stuhood added the benchmark-queries Add this label to a PR to run the benchmark-queries datasets. Automatically removed. label Jan 28, 2026
@github-actions github-actions bot removed the benchmark-queries Add this label to a PR to run the benchmark-queries datasets. Automatically removed. label Jan 28, 2026
@stuhood stuhood force-pushed the stuhood.optimized-plan branch from b7ae6b8 to 72d412b Compare January 28, 2026 05:48
@stuhood stuhood added the benchmark-queries Add this label to a PR to run the benchmark-queries datasets. Automatically removed. label Jan 28, 2026
@github-actions github-actions bot removed the benchmark-queries Add this label to a PR to run the benchmark-queries datasets. Automatically removed. label Jan 28, 2026
Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'pg_search 'docs' Query Performance'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.10.

Benchmark suite Current: a7dcac7 Previous: c0c03ea Ratio
permissioned_search - alternative 1 27483.391 median ms 1560.5149999999999 median ms 17.61
semi_join_filter - alternative 1 50969.813500000004 median ms 31780.788999999997 median ms 1.60

This comment was automatically generated by workflow using github-action-benchmark.

CC: @stuhood

@stuhood stuhood force-pushed the stuhood.optimized-plan branch from 72d412b to a7dcac7 Compare January 28, 2026 21:38
@stuhood stuhood added the benchmark-queries Add this label to a PR to run the benchmark-queries datasets. Automatically removed. label Jan 28, 2026
@github-actions github-actions bot removed the benchmark-queries Add this label to a PR to run the benchmark-queries datasets. Automatically removed. label Jan 28, 2026
github-actions[bot]

This comment was marked as outdated.

@stuhood stuhood force-pushed the stuhood.optimized-plan branch from a7dcac7 to 9ba22cd Compare January 28, 2026 23:06
@stuhood stuhood marked this pull request as ready for review January 28, 2026 23:06
@stuhood stuhood changed the title perf: Use DataFusion's optimizer in the join scan chore: Use DataFusion's optimizer in the join scan Jan 29, 2026
Comment thread pg_search/src/postgres/customscan/joinscan/planning.rs
Comment thread pg_search/src/scan/table_provider.rs
Comment thread pg_search/src/scan/table_provider.rs
Comment thread pg_search/src/scan/table_provider.rs
Copy link
Copy Markdown
Contributor

@mdashti mdashti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stuhood Thanks for the PR

Comment thread pg_search/src/scan/table_provider.rs
Comment thread pg_search/src/postgres/customscan/joinscan/scan_state.rs
Comment thread pg_search/src/postgres/customscan/joinscan/planning.rs
Comment thread pg_search/src/scan/table_provider.rs
Comment thread pg_search/src/postgres/customscan/joinscan/scan_state.rs
Comment thread pg_search/src/postgres/customscan/joinscan/planning.rs
Comment thread pg_search/src/postgres/customscan/joinscan/scan_state.rs
Comment thread pg_search/src/postgres/customscan/joinscan/mod.rs
Comment thread pg_search/src/postgres/customscan/joinscan/planning.rs
Comment thread pg_search/src/postgres/customscan/joinscan/mod.rs
Comment thread pg_search/src/scan/table_provider.rs
Copy link
Copy Markdown
Collaborator Author

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review feedback!

I've applied most of it to #4039, and will break out issues for the rest.

Comment thread pg_search/src/postgres/customscan/joinscan/scan_state.rs
Comment thread pg_search/src/scan/table_provider.rs
Comment thread pg_search/src/scan/table_provider.rs
Comment thread pg_search/src/postgres/customscan/joinscan/planning.rs
Comment thread pg_search/src/scan/table_provider.rs
Comment thread pg_search/src/postgres/customscan/joinscan/mod.rs
Comment thread pg_search/src/postgres/customscan/joinscan/planning.rs
Comment thread pg_search/src/postgres/customscan/joinscan/planning.rs
Comment thread pg_search/src/postgres/customscan/joinscan/scan_state.rs
Comment thread pg_search/src/postgres/customscan/joinscan/planning.rs
@stuhood stuhood merged commit a93a008 into main Feb 2, 2026
28 of 46 checks passed
@stuhood stuhood deleted the stuhood.optimized-plan branch February 2, 2026 20:45
stuhood added a commit that referenced this pull request Feb 3, 2026
…4039)

## What

Refactored the join scan to support pushing down nested joins (e.g., `(A
JOIN B) JOIN C`).

## Why

To enable multi-table joins to be executed in a columnar fashion by
DataFusion, and to avoid materializing tuples until after a LIMIT can be
safely applied (for TopN).

## How

* Replaced (mostly) explicit use of binary outer/inner sides with
collections of `JoinSource`s, to support arbitrary numbers of joined
relations.
* Extracted fast field pullup into a new `customscan/pullup.rs` module.
* Applied leftover review feedback from #4009

## Tests

Expanded tests to cover multi-table scenarios, and added "alternatives"
which use the join scan to our existing three-table benchmarks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Do Not Cherry Pick PR should not be cherry-picked to other branches

Projects

None yet

Development

Successfully merging this pull request may close these issues.

JOINs: M1: Implement execution of the joinscan

3 participants