ARROW-10968: [Rust][DataFusion] Don't build hash table for right side of join#8965
ARROW-10968: [Rust][DataFusion] Don't build hash table for right side of join#8965Dandandan wants to merge 6 commits intoapache:masterfrom
Conversation
Codecov Report
@@ Coverage Diff @@
## master #8965 +/- ##
=======================================
Coverage 83.25% 83.25%
=======================================
Files 196 196
Lines 48116 48195 +79
=======================================
+ Hits 40059 40127 +68
- Misses 8057 8068 +11
Continue to review full report at Codecov.
|
| }) | ||
| } else { | ||
| // key not on the right => push Nones | ||
| left_indexes.iter().for_each(|x| { |
There was a problem hiding this comment.
Isn't this wrong already? Shouldn't it visit all right batches before adding nulls for the left side that had no matches at all?
There was a problem hiding this comment.
But I think this should be resolved in another PR. I think best would to create/keep a bitmap for each index on the left during the join.
There was a problem hiding this comment.
jorgecarleitao
left a comment
There was a problem hiding this comment.
![]()
Really good idea and impressive performance improvement. Thanks a lot @Dandandan !
andygrove
left a comment
There was a problem hiding this comment.
I tested this locally. Very nice speedup 🚀
This PR changes to not build an index for the probe side of the join. As I observed while writing the PR for adding an optimization pass for the build/probe side of joins, currently it takes more time to have the biggest table on the probe side, which is not what's expected.
The current implementation also creates a hashset for both the left and right side for each new batch for inner joins.
This change has big impact on join performance, e.g. TCP-H query 12 has a >4x speedup and query 5 a 16x speed up.
Query 12 (locally, in memory).
Master
PR
Query 5: ~16x speedup
Master:
PR
FYI @andygrove @jorgecarleitao