ARROW-10968: [Rust][DataFusion] Don't build hash table for right side of join by Dandandan · Pull Request #8965 · apache/arrow

Dandandan · 2020-12-18T22:54:21Z

This PR changes to not build an index for the probe side of the join. As I observed while writing the PR for adding an optimization pass for the build/probe side of joins, currently it takes more time to have the biggest table on the probe side, which is not what's expected.
The current implementation also creates a hashset for both the left and right side for each new batch for inner joins.

This change has big impact on join performance, e.g. TCP-H query 12 has a >4x speedup and query 5 a 16x speed up.

Query 12 (locally, in memory).

Master

Query 12 iteration 0 took 1102 ms
Query 12 iteration 1 took 1084 ms
Query 12 iteration 2 took 1099 ms
Query 12 iteration 3 took 1077 ms
Query 12 iteration 4 took 1082 ms
Query 12 iteration 5 took 1098 ms
Query 12 iteration 6 took 1081 ms
Query 12 iteration 7 took 1101 ms
Query 12 iteration 8 took 1138 ms
Query 12 iteration 9 took 1084 ms

PR

Query 12 iteration 0 took 257 ms
Query 12 iteration 1 took 255 ms
Query 12 iteration 2 took 255 ms
Query 12 iteration 3 took 254 ms
Query 12 iteration 4 took 260 ms
Query 12 iteration 5 took 261 ms
Query 12 iteration 6 took 266 ms
Query 12 iteration 7 took 259 ms
Query 12 iteration 8 took 256 ms
Query 12 iteration 9 took 255 ms

Query 5: ~16x speedup

Master:

Query 5 iteration 0 took 15857 ms
Query 5 iteration 1 took 15428 ms
Query 5 iteration 2 took 15234 ms
Query 5 iteration 3 took 15024 ms
Query 5 iteration 4 took 14942 ms
Query 5 iteration 5 took 14926 ms
Query 5 iteration 6 took 14900 ms
Query 5 iteration 7 took 15073 ms
Query 5 iteration 8 took 15176 ms
Query 5 iteration 9 took 15076 ms

PR

Query 5 iteration 0 took 1282 ms
Query 5 iteration 1 took 930 ms
Query 5 iteration 2 took 940 ms
Query 5 iteration 3 took 882 ms
Query 5 iteration 4 took 891 ms
Query 5 iteration 5 took 903 ms
Query 5 iteration 6 took 903 ms
Query 5 iteration 7 took 900 ms
Query 5 iteration 8 took 905 ms
Query 5 iteration 9 took 905 ms

FYI @andygrove @jorgecarleitao

github-actions · 2020-12-18T22:57:07Z

https://issues.apache.org/jira/browse/ARROW-10968

codecov-io · 2020-12-18T23:11:58Z

Codecov Report

Merging #8965 (9ed27d5) into master (d65ba4e) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #8965   +/-   ##
=======================================
  Coverage   83.25%   83.25%           
=======================================
  Files         196      196           
  Lines       48116    48195   +79     
=======================================
+ Hits        40059    40127   +68     
- Misses       8057     8068   +11

Impacted Files	Coverage Δ
rust/datafusion/src/physical_plan/hash_join.rs	`92.16% <100.00%> (+0.07%)`	⬆️
rust/parquet/src/arrow/array_reader.rs	`77.00% <0.00%> (-0.56%)`	⬇️
rust/parquet/src/arrow/schema.rs	`91.31% <0.00%> (-0.50%)`	⬇️
rust/parquet/src/encodings/encoding.rs	`95.24% <0.00%> (-0.20%)`	⬇️
rust/parquet/src/file/statistics.rs	`93.80% <0.00%> (ø)`
rust/arrow/src/array/array_binary.rs	`90.73% <0.00%> (+0.21%)`	⬆️
rust/parquet/src/schema/types.rs	`90.19% <0.00%> (+0.26%)`	⬆️
rust/datafusion/src/datasource/parquet.rs	`95.62% <0.00%> (+0.30%)`	⬆️
rust/parquet/src/arrow/arrow_reader.rs	`91.25% <0.00%> (+0.66%)`	⬆️
rust/parquet/src/file/metadata.rs	`91.82% <0.00%> (+0.77%)`	⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d65ba4e...9ed27d5. Read the comment docs.

Dandandan · 2020-12-18T23:17:30Z

rust/datafusion/src/physical_plan/hash_join.rs

-                    })
-                } else {
-                    // key not on the right => push Nones
-                    left_indexes.iter().for_each(|x| {


Isn't this wrong already? Shouldn't it visit all right batches before adding nulls for the left side that had no matches at all?

But I think this should be resolved in another PR. I think best would to create/keep a bitmap for each index on the left during the join.

Opened https://issues.apache.org/jira/browse/ARROW-10971 for this

jorgecarleitao

Really good idea and impressive performance improvement. Thanks a lot @Dandandan !

andygrove

I tested this locally. Very nice speedup 🚀

Dandandan added 2 commits December 18, 2020 23:47

Don't build hash table for right side of join

f8ca860

Remove code

7c03853

github-actions bot added Component: Rust - DataFusion Component: Rust labels Dec 18, 2020

Update comments

5159ef2

Dandandan mentioned this pull request Dec 18, 2020

ARROW-10885: [Rust][DataFusion] Optimize hash join build vs probe order based on number of rows #8961

Closed

Dandandan commented Dec 18, 2020

View reviewed changes

Dandandan added 3 commits December 19, 2020 08:34

Clippy, don't store zeros for batches

1985647

Remove old comment

dff61d1

Renaming

9ed27d5

jorgecarleitao approved these changes Dec 20, 2020

View reviewed changes

andygrove approved these changes Dec 20, 2020

View reviewed changes

andygrove closed this in a054c78 Dec 20, 2020

asfimport mentioned this pull request Dec 20, 2020

[Rust][DataFusion] Don't build hash table for right side of the join #26891

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-10968: [Rust][DataFusion] Don't build hash table for right side of join#8965

ARROW-10968: [Rust][DataFusion] Don't build hash table for right side of join#8965
Dandandan wants to merge 6 commits intoapache:masterfrom
Dandandan:right_hash

Dandandan commented Dec 18, 2020 •

edited

Loading

Uh oh!

github-actions bot commented Dec 18, 2020

Uh oh!

codecov-io commented Dec 18, 2020 •

edited

Loading

Uh oh!

Dandandan Dec 18, 2020 •

edited

Loading

Uh oh!

Dandandan Dec 19, 2020 •

edited

Loading

Uh oh!

Dandandan Dec 19, 2020

Uh oh!

jorgecarleitao left a comment

Uh oh!

andygrove left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Dandandan commented Dec 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 18, 2020

Uh oh!

codecov-io commented Dec 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Dandandan Dec 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan Dec 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan Dec 19, 2020

Choose a reason for hiding this comment

Uh oh!

jorgecarleitao left a comment

Choose a reason for hiding this comment

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Dandandan commented Dec 18, 2020 •

edited

Loading

codecov-io commented Dec 18, 2020 •

edited

Loading

Dandandan Dec 18, 2020 •

edited

Loading

Dandandan Dec 19, 2020 •

edited

Loading