ARROW-10885: [Rust][DataFusion] Optimize hash join build vs probe order based on number of rows by Dandandan · Pull Request #8961 · apache/arrow

Dandandan · 2020-12-18T13:37:37Z

This PR uses the num_rows statistics to implement a common optimization to use the smallest table for the build phase.
This is a good heuristic, as to have the smallest table used in the build phase leads to less items to be inserted to the hash table, in particular if the size of tables is very imbalanced.

Some notes:

The optimization works on the LogicalPlan by swapping left and right, the join type and the key order. This seems currently the easiest place to add it, as there is no cost based optimizer and/or optimizers on the physical plan yet. The optimization rule assumes that the left part of the join will be used for the build phase and the right part for the probe phase.
It requires the number of rows to be exactly known, so it will not work whenever there is a transformation changing the number of rows, except for limit. The idea here is that in other cases, it is very hard to estimate the number of resulting rows.
The impact currently is measurable on queries with a bigger left side of an (inner) join

FYI @andygrove @jorgecarleitao

github-actions · 2020-12-18T13:51:21Z

https://issues.apache.org/jira/browse/ARROW-10885

codecov-io · 2020-12-18T14:15:21Z

Codecov Report

Merging #8961 (1430e0d) into master (a054c78) will decrease coverage by 0.03%.
The diff coverage is 61.70%.

@@            Coverage Diff             @@
##           master    #8961      +/-   ##
==========================================
- Coverage   83.20%   83.16%   -0.04%     
==========================================
  Files         199      200       +1     
  Lines       48857    48946      +89     
==========================================
+ Hits        40651    40708      +57     
- Misses       8206     8238      +32

Impacted Files	Coverage Δ
rust/datafusion/src/logical_plan/plan.rs	`88.12% <ø> (ø)`
rust/datafusion/src/physical_plan/hash_utils.rs	`97.10% <ø> (ø)`
rust/datafusion/src/physical_plan/planner.rs	`80.45% <ø> (ø)`
rust/datafusion/src/sql/parser.rs	`86.87% <ø> (ø)`
...datafusion/src/optimizer/hash_build_probe_order.rs	`59.09% <59.09%> (ø)`
rust/datafusion/src/execution/context.rs	`90.00% <100.00%> (+0.01%)`	⬆️
...t/datafusion/src/optimizer/projection_push_down.rs	`97.70% <100.00%> (ø)`
rust/datafusion/src/optimizer/utils.rs	`61.75% <100.00%> (ø)`
rust/datafusion/src/physical_plan/hash_join.rs	`92.77% <100.00%> (ø)`
rust/datafusion/src/sql/planner.rs	`84.79% <100.00%> (ø)`
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a054c78...1430e0d. Read the comment docs.

Dandandan · 2020-12-18T14:42:03Z

rust/datafusion/src/optimizer/hash_build_probe_order.rs

+        LogicalPlan::Projection { input, .. } => get_num_rows(input),
+        LogicalPlan::Sort { input, .. } => get_num_rows(input),
+        LogicalPlan::TableScan { source, .. } => source.statistics().num_rows,
+        LogicalPlan::EmptyRelation {


Not sure if this is relevant, where this is used?

This looks good. This is used for projections without an input, such as SELECT 1.

rust/datafusion/src/optimizer/hash_build_probe_order.rs

andygrove · 2020-12-18T15:16:12Z

rust/datafusion/src/optimizer/hash_build_probe_order.rs

+                        right: left.clone(),
+                        on: on
+                            .iter()
+                            .map(|(l, r)| (r.to_string(), l.to_string()))


in theory, this is unnecessary since there are no restrictions in SQL on order of join conditions. However, it is possible we do make some assumptions so if you ean into that it would be good to file an issue for it.

Makes sense, was surprised by it that I had to. Currently it fails when you change the order (e.g. in a query already) without changing the key order.

Filed here https://issues.apache.org/jira/browse/ARROW-10965

andygrove · 2020-12-18T15:19:24Z

rust/datafusion/src/optimizer/hash_build_probe_order.rs

+            } => {
+                if should_swap_join_order(left, right) {
+                    // Swap left and right, change join type and (equi-)join key order
+                    Ok(LogicalPlan::Join {


I'm not sure if it is relevant, but when I have done this in other projects, I have wrapped the swapped join in a projection to preserve the column ordering of the output. This would be less surprising to a user if for some reason there is no final projection.

Isn't the order explicit in the schema (which is the same) or will it be changed based on swapping left and right?

We should add a unit test for this so that we know the answer to that question, I think.

andygrove · 2020-12-18T15:20:17Z

Thanks @Dandandan this is looking great 🚀

andygrove · 2020-12-18T16:33:49Z

rust/datafusion/src/optimizer/hash_build_probe_order.rs

+pub struct HashBuildProbeOrder {}
+
+// Gets exact number of rows, if known by the statistics of the underlying
+fn get_num_rows(logical_plan: &LogicalPlan) -> Option<usize> {


I have been thinking about what we can do to estimate the number of rows coming out of joins so that we can extend this optimization to nested joins. We can't do anything accurate with the current statistics in this case but I feel that we should try and do something rather than just pick the left side as the build side.

One idea is to assume that all joins produce a cartesian product (left row count * right row count). This would at least help in the case where two small tables are joined, and then joined with a huge table, or the other way around.

Indeed I think that could be very beneficial but estimating it before executing might be really hard / impossible?

Also, if using the left as build side wrong, at this moment, the order could be changed by the user by changing the query itself, which you lose by having a heuristic that can be wrong (or you have to provide some other mechanism , e.g. providing a query hint)

I think ideally you should be able to know more about the table size when the query is executing (a la Spark 3 adaptive query execution) so you don't do the wrong thing. BigQuery also has a nice strategy / explanation for this https://cloud.google.com/bigquery/query-plan-explanation This probably requires quite a bit of changes on the execution / planning side, but this would bring much more available statistics to each step during execution to be able to change optimize the plan further.

Some databases use the STRAIGHT_JOIN modifier to force joins to happen in the user-specified order. This is from Impala docs:

If statistics are not available for all the tables in the join query, or if Impala chooses a join order that is not the most efficient, you can override the automatic join order optimization by specifying the STRAIGHT_JOIN keyword immediately after the SELECT and any DISTINCT or ALL keywords. In this case, Impala uses the order the tables appear in the query to guide how the joins are processed.

I think we can merge this PR as is and continue this discussion. Spark's AQE approach would mean that we have the statistics, but only if we load both sides into memory first (or scan them first for row counts) which would possibly defeat the point of this optimization. It would also mean that the next operator in the query plan wouldn't be able to start streaming until the join has completed? This is a tricky area.

I filed https://issues.apache.org/jira/browse/ARROW-10964 for "Optimize nested joins" and referenced this discussion.

andygrove · 2020-12-18T17:28:21Z

I do think that we should start looking at optimizations on the physical plan and eventually move this optimization there. I also do think that an adaptive execution approach makes sense, especially in a distributed context. I think it might not work so well for the current single node / in-process execution approach though.

seddonm1 · 2020-12-18T23:00:38Z

This is awesome work @Dandandan 👍

Dandandan · 2020-12-18T23:07:58Z

Also related to #8965 which stops generating/using an index for the probe side.

Dandandan · 2020-12-19T07:53:17Z

I checked merging the other PR #8965 which improves the join implementation.

Besides being much faster regardless of this PR, reordering gives a further ~15% reduction in time when reordering the following query (6001214 left vs 1499999 rows on the right)

select
                l_shipmode,
                sum(case
                    when o_orderpriority = '1-URGENT'
                        or o_orderpriority = '2-HIGH'
                        then 1
                    else 0
                end) as high_line_count,
                sum(case
                    when o_orderpriority <> '1-URGENT'
                        and o_orderpriority <> '2-HIGH'
                        then 1
                    else 0
                end) as low_line_count
            from
                lineitem
            join
                orders
            on
                l_orderkey = o_orderkey
            group by
                l_shipmode
            order by
                l_shipmode;"

Dandandan · 2020-12-19T16:02:39Z

I wrote some details of the PRs for a planned blog post.

https://docs.google.com/document/d/1Urxm34rl8DZ5D0vyhlrrBoZK6IHW7WFRN3hsaTfPujg/edit?usp=drivesdk

andygrove · 2020-12-20T17:56:28Z

@Dandandan This needs rebasing - I tried merging into master locally before merging this and got some compilation errors.

error[E0050]: method `scan` has 3 parameters but the declaration in trait `datasource::datasource::TableProvider::scan` has 4
   --> datafusion/src/optimizer/hash_build_probe_order.rs:179:13
    |
179 | /             &self,
180 | |             _projection: &Option<Vec<usize>>,
181 | |             _batch_size: usize,
    | |______________________________^ expected 4 parameters, found 3
    | 
   ::: datafusion/src/datasource/datasource.rs:66:9
    |
66  | /         &self,
67  | |         projection: &Option<Vec<usize>>,
68  | |         batch_size: usize,
69  | |         filters: &[Expr],
    | |________________________- trait requires 4 parameters

error: cannot construct `plan::LogicalPlan` with struct literal syntax due to inaccessible fields
   --> datafusion/src/optimizer/hash_build_probe_order.rs:204:23
    |
204 |         let lp_left = LogicalPlan::TableScan {
    |                       ^^^^^^^^^^^^^^^^^^^^^^

error: cannot construct `plan::LogicalPlan` with struct literal syntax due to inaccessible fields
   --> datafusion/src/optimizer/hash_build_probe_order.rs:211:24
    |
211 |         let lp_right = LogicalPlan::TableScan {
    |                        ^^^^^^^^^^^^^^^^^^^^^^

Dandandan · 2020-12-20T18:02:57Z

@andygrove thanks, will do. Will also enable it now that #8965 is merged.

Dandandan · 2020-12-20T18:50:09Z

@andygrove updated & enabled the optimization now.

andygrove · 2020-12-20T18:57:56Z

I merged with master locally and tested it and see a speedup. I will merge when CI is green. Thanks @Dandandan

Dandandan added 3 commits December 17, 2020 09:11

Add some pseudo code for hash order optimization based on rows

88fea63

WIP

8e5cbe5

Convert to optimizer pass, make it compile

1ab3574

github-actions bot added Component: Rust - DataFusion Component: Rust labels Dec 18, 2020

Dandandan added 3 commits December 18, 2020 14:38

Clippy

956b882

Remove printlns

9de63b6

Add unit test for num rows

26852aa

Test import fix

7b7e1b8

Dandandan added 3 commits December 18, 2020 15:36

Clippy, add extra test for swapping order

92f3043

Cleanup, add some comments

c871217

Disable optimization in code

f55ba74

Dandandan commented Dec 18, 2020

View reviewed changes

andygrove reviewed Dec 18, 2020

View reviewed changes

rust/datafusion/src/optimizer/hash_build_probe_order.rs Show resolved Hide resolved

andygrove reviewed Dec 18, 2020

View reviewed changes

Dandandan added 3 commits December 18, 2020 16:23

Complete comment

35c393c

Missing derived Copy

51f700c

Add import to tests

680e626

andygrove reviewed Dec 18, 2020

View reviewed changes

Dandandan added 2 commits December 18, 2020 17:34

Test fix

6aa357a

Unneeded clone

da7a858

andygrove approved these changes Dec 18, 2020

View reviewed changes

Dandandan marked this pull request as ready for review December 18, 2020 18:19

Apply optimization recursively in join too

d9ffa00

Dandandan added 2 commits December 20, 2020 19:44

Merge remote-tracking branch 'upstream/master' into rows_hash

ad62963

Update & enable

959ce86

Remove outdated comment

1430e0d

andygrove closed this in c019e79 Dec 20, 2020

alamb mentioned this pull request Apr 26, 2021

Optimize nested joins apache/datafusion#128

Closed

asfimport mentioned this pull request Dec 20, 2020

[Rust][DataFusion] Optimize join build vs probe based on statistics on row number #26819

Closed

Conversation

Dandandan commented Dec 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 18, 2020

Uh oh!

codecov-io commented Dec 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan Dec 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andygrove Dec 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andygrove commented Dec 18, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan Dec 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andygrove Dec 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andygrove commented Dec 18, 2020

Uh oh!

seddonm1 commented Dec 18, 2020

Uh oh!

Dandandan commented Dec 18, 2020

Uh oh!

Dandandan commented Dec 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dandandan commented Dec 19, 2020

Uh oh!

andygrove commented Dec 20, 2020

Uh oh!

Dandandan commented Dec 20, 2020

Uh oh!

Dandandan commented Dec 20, 2020

Uh oh!

andygrove commented Dec 20, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Dandandan commented Dec 18, 2020 •

edited

Loading

codecov-io commented Dec 18, 2020 •

edited

Loading

Dandandan Dec 18, 2020 •

edited

Loading

andygrove Dec 18, 2020 •

edited

Loading

Dandandan Dec 18, 2020 •

edited

Loading

andygrove Dec 18, 2020 •

edited

Loading

Dandandan commented Dec 19, 2020 •

edited

Loading