Hash Join Vectorized collision checking#6724
Conversation
| indices_right: UInt32Array, | ||
| left_arrays: &[ArrayRef], | ||
| right_arrays: &[ArrayRef], | ||
| null_equals_null: bool, |
There was a problem hiding this comment.
AFAIK null_equals_null is not supported in the arrow kernel.
There was a problem hiding this comment.
Indeed - working on implementing this now
There was a problem hiding this comment.
How about folding?
pub fn equal_rows_arr(
indices_left: UInt64Array,
indices_right: UInt32Array,
left_arrays: &[ArrayRef],
right_arrays: &[ArrayRef],
null_equals_null: bool,
) -> Result<(UInt64Array, UInt32Array)> {
let mut iter = left_arrays.iter().zip(right_arrays.iter());
let (first_left, first_right) = iter.next().ok_or_else(|| {
DataFusionError::Internal("At least one array should be provided for both left and right".to_string())
})?;
let arr_left = take(first_left.as_ref(), &indices_left, None)?;
let arr_right = take(first_right.as_ref(), &indices_right, None)?;
let mut equal = eq_dyn_null(&arr_left, &arr_right, null_equals_null)?;
// Use map and try_fold to iterate over the remaining pairs of arrays.
// In each iteration, take is used on the pair of arrays and their equality is determined.
// The results are then folded (combined) using the and function to get a final equality result.
equal = iter
.map(|(left, right)| {
let arr_left = take(left.as_ref(), &indices_left, None)?;
let arr_right = take(right.as_ref(), &indices_right, None)?;
eq_dyn_null(arr_left.as_ref(), arr_right.as_ref(), null_equals_null)
})
.try_fold(equal, |acc, res| {
let equal2 = res?;
and(&acc, &equal2)
})?;
// Continue
}There was a problem hiding this comment.
In addition, we do not need to take ownership of indices_left and indices_right.
| err.unwrap_or(Ok(res)) | ||
| } | ||
|
|
||
| fn eq_dyn_null( |
There was a problem hiding this comment.
A null element can occur in an Int32Array as well. Maybe we should insert an option to arrow kernel, in arrow-rs.
There was a problem hiding this comment.
I think we can open an issue on arrow-rs for this functionality.
There was a problem hiding this comment.
|
There is a significant improvement in the equality check, nice work. Maybe then the |
|
LGTM, nice work @Dandandan thanks for the effort. |
Thanks for the feedback! |
|
Btw I think this can be applied to the symmetric hash join as well without changes. |
Which issue does this PR close?
Closes #50
Rationale for this change
Speed up for hash joins:
What changes are included in this PR?
Changes the collision checking to be vectorized.
Are these changes tested?
Are there any user-facing changes?