Add distinct kernels (#960) (#4438)#4716
Conversation
arrow-ord/src/cmp.rs
Outdated
There was a problem hiding this comment.
This is technically stricter than the previous logic, as this hasn't been released yet I think this is fine
2c13ff3 to
a2c1cfe
Compare
|
I've integrated this into apache/datafusion#7282 for additional test verification |
| let a = l.value(l_s); | ||
| let b = r.value(r_s); | ||
| std::iter::once(op(a, b)).collect() | ||
| std::iter::once(op(a, b) ^ neg).collect() |
There was a problem hiding this comment.
This was a pre-existing bug, that resulted in incorrect results when comparing scalars
There was a problem hiding this comment.
Can we also please add a documentation comment explaining how to interpret the neg argument?
| } | ||
|
|
||
| impl From<BooleanBuffer> for BooleanArray { | ||
| fn from(values: BooleanBuffer) -> Self { |
There was a problem hiding this comment.
this is a nice UX improvement -- I had been looking for that when working on group by in DataFusion
| compare_op(Op::GreaterEqual, lhs, rhs) | ||
| } | ||
|
|
||
| /// Perform `left IS DISTINCT FROM right` operation on two [`Datum`] |
There was a problem hiding this comment.
| /// Perform `left IS DISTINCT FROM right` operation on two [`Datum`] | |
| /// Perform `left IS DISTINCT FROM right` operation on two [`Datum`]. `IS DISTINCT` | |
| /// similar to `NotEq`, differing in null handling. Two operands are considered DISTINCT | |
| /// if they have a different value or if one of them is NULL and the other isn't. | |
| /// The result of `IS DISTINCT FROM` is never NULL. |
| compare_op(Op::Distinct, lhs, rhs) | ||
| } | ||
|
|
||
| /// Perform `left IS NOT DISTINCT FROM right` operation on two [`Datum`] |
There was a problem hiding this comment.
| /// Perform `left IS NOT DISTINCT FROM right` operation on two [`Datum`] | |
| /// Perform `left IS NOT DISTINCT FROM right` operation on two [`Datum`]. `IS NOT DISTINCT` | |
| /// similar to `Eq`, differing in null handling. Two operands are considered NOT DISTINCT | |
| /// if they have the same value or if both of them are NULL. | |
| /// The result of `IS NOT DISTINCT FROM` is never NULL. |
|
|
||
| let c = |((l, r), n)| ((l ^ r) | (l & r & n)); | ||
| let buffer = l.zip(r).zip(ne).map(c).collect(); | ||
| BooleanBuffer::new(buffer, 0, len).into() |
There was a problem hiding this comment.
💯 for no null buffer -- not sure if that is worth calling out in a comment or not
| let l = nulls.inner().bit_chunks().iter_padded(); | ||
| let ne = values.bit_chunks().iter_padded(); | ||
| let c = |(l, n)| u64::not(l) | n; | ||
| let buffer = l.zip(ne).map(c).collect(); |
There was a problem hiding this comment.
It took me a while to understand why this clause didn't mirror the Op::NotDistinct clause
NotDistinct can simply use https://docs.rs/arrow/latest/arrow/buffer/struct.BooleanBuffer.html#impl-BitOr%3C%26BooleanBuffer%3E-for-%26BooleanBuffer
but It seems it is because there is no equivalent of https://doc.rust-lang.org/nightly/core/ops/trait.Neg.html for BooleanBuffer and even if there were that would result in allocating a temporary buffer that might wasteful
Though now that I write this I wonder if performance could be improved here by deferring / reusing the buffer allocated by values(). Maybe as some future optimization.
There was a problem hiding this comment.
BooleanBuffer does implement https://doc.rust-lang.org/std/ops/trait.Not.html
However, as you surmised, computing the mask as is done here is marginally better for performance.
Deferring the buffer allocated by values would result in non-trivial additional codegen, and would likely confuse LLVMs already temperamental vectorisation,
Reusing the buffer is potentially worth exploring. I suspect that in most cases the performance gain would be marginal at best, but I haven't profiled this
| let a = l.value(l_s); | ||
| let b = r.value(r_s); | ||
| std::iter::once(op(a, b)).collect() | ||
| std::iter::once(op(a, b) ^ neg).collect() |
There was a problem hiding this comment.
Can we also please add a documentation comment explaining how to interpret the neg argument?
| } | ||
| None => values_ne, | ||
| }) | ||
| Ok(distinct(&v1, &v2)?.values().clone()) |
Which issue does this PR close?
Closes #960
Closes #4438
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?