Skip to content

ARROW-10968: [Rust][DataFusion] Don't build hash table for right side of join#8965

Closed
Dandandan wants to merge 6 commits intoapache:masterfrom
Dandandan:right_hash
Closed

ARROW-10968: [Rust][DataFusion] Don't build hash table for right side of join#8965
Dandandan wants to merge 6 commits intoapache:masterfrom
Dandandan:right_hash

Conversation

@Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Dec 18, 2020

This PR changes to not build an index for the probe side of the join. As I observed while writing the PR for adding an optimization pass for the build/probe side of joins, currently it takes more time to have the biggest table on the probe side, which is not what's expected.
The current implementation also creates a hashset for both the left and right side for each new batch for inner joins.

This change has big impact on join performance, e.g. TCP-H query 12 has a >4x speedup and query 5 a 16x speed up.

Query 12 (locally, in memory).

Master

Query 12 iteration 0 took 1102 ms
Query 12 iteration 1 took 1084 ms
Query 12 iteration 2 took 1099 ms
Query 12 iteration 3 took 1077 ms
Query 12 iteration 4 took 1082 ms
Query 12 iteration 5 took 1098 ms
Query 12 iteration 6 took 1081 ms
Query 12 iteration 7 took 1101 ms
Query 12 iteration 8 took 1138 ms
Query 12 iteration 9 took 1084 ms

PR

Query 12 iteration 0 took 257 ms
Query 12 iteration 1 took 255 ms
Query 12 iteration 2 took 255 ms
Query 12 iteration 3 took 254 ms
Query 12 iteration 4 took 260 ms
Query 12 iteration 5 took 261 ms
Query 12 iteration 6 took 266 ms
Query 12 iteration 7 took 259 ms
Query 12 iteration 8 took 256 ms
Query 12 iteration 9 took 255 ms

Query 5: ~16x speedup

Master:

Query 5 iteration 0 took 15857 ms
Query 5 iteration 1 took 15428 ms
Query 5 iteration 2 took 15234 ms
Query 5 iteration 3 took 15024 ms
Query 5 iteration 4 took 14942 ms
Query 5 iteration 5 took 14926 ms
Query 5 iteration 6 took 14900 ms
Query 5 iteration 7 took 15073 ms
Query 5 iteration 8 took 15176 ms
Query 5 iteration 9 took 15076 ms

PR

Query 5 iteration 0 took 1282 ms
Query 5 iteration 1 took 930 ms
Query 5 iteration 2 took 940 ms
Query 5 iteration 3 took 882 ms
Query 5 iteration 4 took 891 ms
Query 5 iteration 5 took 903 ms
Query 5 iteration 6 took 903 ms
Query 5 iteration 7 took 900 ms
Query 5 iteration 8 took 905 ms
Query 5 iteration 9 took 905 ms

FYI @andygrove @jorgecarleitao

@github-actions
Copy link

@codecov-io
Copy link

codecov-io commented Dec 18, 2020

Codecov Report

Merging #8965 (9ed27d5) into master (d65ba4e) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #8965   +/-   ##
=======================================
  Coverage   83.25%   83.25%           
=======================================
  Files         196      196           
  Lines       48116    48195   +79     
=======================================
+ Hits        40059    40127   +68     
- Misses       8057     8068   +11     
Impacted Files Coverage Δ
rust/datafusion/src/physical_plan/hash_join.rs 92.16% <100.00%> (+0.07%) ⬆️
rust/parquet/src/arrow/array_reader.rs 77.00% <0.00%> (-0.56%) ⬇️
rust/parquet/src/arrow/schema.rs 91.31% <0.00%> (-0.50%) ⬇️
rust/parquet/src/encodings/encoding.rs 95.24% <0.00%> (-0.20%) ⬇️
rust/parquet/src/file/statistics.rs 93.80% <0.00%> (ø)
rust/arrow/src/array/array_binary.rs 90.73% <0.00%> (+0.21%) ⬆️
rust/parquet/src/schema/types.rs 90.19% <0.00%> (+0.26%) ⬆️
rust/datafusion/src/datasource/parquet.rs 95.62% <0.00%> (+0.30%) ⬆️
rust/parquet/src/arrow/arrow_reader.rs 91.25% <0.00%> (+0.66%) ⬆️
rust/parquet/src/file/metadata.rs 91.82% <0.00%> (+0.77%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d65ba4e...9ed27d5. Read the comment docs.

})
} else {
// key not on the right => push Nones
left_indexes.iter().for_each(|x| {
Copy link
Contributor Author

@Dandandan Dandandan Dec 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this wrong already? Shouldn't it visit all right batches before adding nulls for the left side that had no matches at all?

Copy link
Contributor Author

@Dandandan Dandandan Dec 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think this should be resolved in another PR. I think best would to create/keep a bitmap for each index on the left during the join.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Really good idea and impressive performance improvement. Thanks a lot @Dandandan !

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this locally. Very nice speedup 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants