ARROW-10885: [Rust][DataFusion] Optimize hash join build vs probe order based on number of rows#8961
ARROW-10885: [Rust][DataFusion] Optimize hash join build vs probe order based on number of rows#8961Dandandan wants to merge 19 commits intoapache:masterfrom
Conversation
Codecov Report
@@ Coverage Diff @@
## master #8961 +/- ##
==========================================
- Coverage 83.20% 83.16% -0.04%
==========================================
Files 199 200 +1
Lines 48857 48946 +89
==========================================
+ Hits 40651 40708 +57
- Misses 8206 8238 +32
Continue to review full report at Codecov.
|
| LogicalPlan::Projection { input, .. } => get_num_rows(input), | ||
| LogicalPlan::Sort { input, .. } => get_num_rows(input), | ||
| LogicalPlan::TableScan { source, .. } => source.statistics().num_rows, | ||
| LogicalPlan::EmptyRelation { |
There was a problem hiding this comment.
Not sure if this is relevant, where this is used?
There was a problem hiding this comment.
This looks good. This is used for projections without an input, such as SELECT 1.
| right: left.clone(), | ||
| on: on | ||
| .iter() | ||
| .map(|(l, r)| (r.to_string(), l.to_string())) |
There was a problem hiding this comment.
in theory, this is unnecessary since there are no restrictions in SQL on order of join conditions. However, it is possible we do make some assumptions so if you ean into that it would be good to file an issue for it.
There was a problem hiding this comment.
Makes sense, was surprised by it that I had to. Currently it fails when you change the order (e.g. in a query already) without changing the key order.
There was a problem hiding this comment.
| } => { | ||
| if should_swap_join_order(left, right) { | ||
| // Swap left and right, change join type and (equi-)join key order | ||
| Ok(LogicalPlan::Join { |
There was a problem hiding this comment.
I'm not sure if it is relevant, but when I have done this in other projects, I have wrapped the swapped join in a projection to preserve the column ordering of the output. This would be less surprising to a user if for some reason there is no final projection.
There was a problem hiding this comment.
Isn't the order explicit in the schema (which is the same) or will it be changed based on swapping left and right?
There was a problem hiding this comment.
We should add a unit test for this so that we know the answer to that question, I think.
|
Thanks @Dandandan this is looking great 🚀 |
| pub struct HashBuildProbeOrder {} | ||
|
|
||
| // Gets exact number of rows, if known by the statistics of the underlying | ||
| fn get_num_rows(logical_plan: &LogicalPlan) -> Option<usize> { |
There was a problem hiding this comment.
I have been thinking about what we can do to estimate the number of rows coming out of joins so that we can extend this optimization to nested joins. We can't do anything accurate with the current statistics in this case but I feel that we should try and do something rather than just pick the left side as the build side.
One idea is to assume that all joins produce a cartesian product (left row count * right row count). This would at least help in the case where two small tables are joined, and then joined with a huge table, or the other way around.
There was a problem hiding this comment.
Indeed I think that could be very beneficial but estimating it before executing might be really hard / impossible?
Also, if using the left as build side wrong, at this moment, the order could be changed by the user by changing the query itself, which you lose by having a heuristic that can be wrong (or you have to provide some other mechanism , e.g. providing a query hint)
I think ideally you should be able to know more about the table size when the query is executing (a la Spark 3 adaptive query execution) so you don't do the wrong thing. BigQuery also has a nice strategy / explanation for this https://cloud.google.com/bigquery/query-plan-explanation This probably requires quite a bit of changes on the execution / planning side, but this would bring much more available statistics to each step during execution to be able to change optimize the plan further.
There was a problem hiding this comment.
Some databases use the STRAIGHT_JOIN modifier to force joins to happen in the user-specified order. This is from Impala docs:
If statistics are not available for all the tables in the join query, or if Impala chooses a join order that is not the most efficient, you can override the automatic join order optimization by specifying the STRAIGHT_JOIN keyword immediately after the SELECT and any DISTINCT or ALL keywords. In this case, Impala uses the order the tables appear in the query to guide how the joins are processed.
I think we can merge this PR as is and continue this discussion. Spark's AQE approach would mean that we have the statistics, but only if we load both sides into memory first (or scan them first for row counts) which would possibly defeat the point of this optimization. It would also mean that the next operator in the query plan wouldn't be able to start streaming until the join has completed? This is a tricky area.
There was a problem hiding this comment.
I filed https://issues.apache.org/jira/browse/ARROW-10964 for "Optimize nested joins" and referenced this discussion.
|
I do think that we should start looking at optimizations on the physical plan and eventually move this optimization there. I also do think that an adaptive execution approach makes sense, especially in a distributed context. I think it might not work so well for the current single node / in-process execution approach though. |
|
This is awesome work @Dandandan 👍 |
|
Also related to #8965 which stops generating/using an index for the probe side. |
|
I checked merging the other PR #8965 which improves the join implementation. Besides being much faster regardless of this PR, reordering gives a further ~15% reduction in time when reordering the following query (6001214 left vs 1499999 rows on the right) |
|
I wrote some details of the PRs for a planned blog post. |
|
@Dandandan This needs rebasing - I tried merging into master locally before merging this and got some compilation errors. |
|
@andygrove thanks, will do. Will also enable it now that #8965 is merged. |
|
@andygrove updated & enabled the optimization now. |
|
I merged with master locally and tested it and see a speedup. I will merge when CI is green. Thanks @Dandandan |
This PR uses the
num_rowsstatistics to implement a common optimization to use the smallest table for the build phase.This is a good heuristic, as to have the smallest table used in the build phase leads to less items to be inserted to the hash table, in particular if the size of tables is very imbalanced.
Some notes:
LogicalPlanby swapping left and right, the join type and the key order. This seems currently the easiest place to add it, as there is no cost based optimizer and/or optimizers on the physical plan yet. The optimization rule assumes that the left part of the join will be used for the build phase and the right part for the probe phase.limit. The idea here is that in other cases, it is very hard to estimate the number of resulting rows.FYI @andygrove @jorgecarleitao