Skip to content

Add distinct kernels (#960) (#4438)#4716

Merged
tustvold merged 7 commits intoapache:masterfrom
tustvold:distinct-experiments
Aug 21, 2023
Merged

Add distinct kernels (#960) (#4438)#4716
tustvold merged 7 commits intoapache:masterfrom
tustvold:distinct-experiments

Conversation

@tustvold
Copy link
Contributor

@tustvold tustvold commented Aug 18, 2023

Which issue does this PR close?

Closes #960
Closes #4438

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Aug 18, 2023
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is technically stricter than the previous logic, as this hasn't been released yet I think this is fine

@tustvold tustvold force-pushed the distinct-experiments branch from 2c13ff3 to a2c1cfe Compare August 18, 2023 17:35
@tustvold tustvold marked this pull request as ready for review August 18, 2023 19:18
@tustvold
Copy link
Contributor Author

I've integrated this into apache/datafusion#7282 for additional test verification

let a = l.value(l_s);
let b = r.value(r_s);
std::iter::once(op(a, b)).collect()
std::iter::once(op(a, b) ^ neg).collect()
Copy link
Contributor Author

@tustvold tustvold Aug 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a pre-existing bug, that resulted in incorrect results when comparing scalars

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also please add a documentation comment explaining how to interpret the neg argument?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thank you @tustvold

I went over the logic and tests carefully. There are a few comments I suggested that would add clarity to what IS DISTINCT means but I don't think they are required

}

impl From<BooleanBuffer> for BooleanArray {
fn from(values: BooleanBuffer) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a nice UX improvement -- I had been looking for that when working on group by in DataFusion

compare_op(Op::GreaterEqual, lhs, rhs)
}

/// Perform `left IS DISTINCT FROM right` operation on two [`Datum`]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Perform `left IS DISTINCT FROM right` operation on two [`Datum`]
/// Perform `left IS DISTINCT FROM right` operation on two [`Datum`]. `IS DISTINCT`
/// similar to `NotEq`, differing in null handling. Two operands are considered DISTINCT
/// if they have a different value or if one of them is NULL and the other isn't.
/// The result of `IS DISTINCT FROM` is never NULL.

compare_op(Op::Distinct, lhs, rhs)
}

/// Perform `left IS NOT DISTINCT FROM right` operation on two [`Datum`]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Perform `left IS NOT DISTINCT FROM right` operation on two [`Datum`]
/// Perform `left IS NOT DISTINCT FROM right` operation on two [`Datum`]. `IS NOT DISTINCT`
/// similar to `Eq`, differing in null handling. Two operands are considered NOT DISTINCT
/// if they have the same value or if both of them are NULL.
/// The result of `IS NOT DISTINCT FROM` is never NULL.


let c = |((l, r), n)| ((l ^ r) | (l & r & n));
let buffer = l.zip(r).zip(ne).map(c).collect();
BooleanBuffer::new(buffer, 0, len).into()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 for no null buffer -- not sure if that is worth calling out in a comment or not

Comment on lines +271 to +274
let l = nulls.inner().bit_chunks().iter_padded();
let ne = values.bit_chunks().iter_padded();
let c = |(l, n)| u64::not(l) | n;
let buffer = l.zip(ne).map(c).collect();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a while to understand why this clause didn't mirror the Op::NotDistinct clause

NotDistinct can simply use https://docs.rs/arrow/latest/arrow/buffer/struct.BooleanBuffer.html#impl-BitOr%3C%26BooleanBuffer%3E-for-%26BooleanBuffer

but It seems it is because there is no equivalent of https://doc.rust-lang.org/nightly/core/ops/trait.Neg.html for BooleanBuffer and even if there were that would result in allocating a temporary buffer that might wasteful

Though now that I write this I wonder if performance could be improved here by deferring / reusing the buffer allocated by values(). Maybe as some future optimization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BooleanBuffer does implement https://doc.rust-lang.org/std/ops/trait.Not.html

However, as you surmised, computing the mask as is done here is marginally better for performance.

Deferring the buffer allocated by values would result in non-trivial additional codegen, and would likely confuse LLVMs already temperamental vectorisation,

Reusing the buffer is potentially worth exploring. I suspect that in most cases the performance gain would be marginal at best, but I haven't profiled this

let a = l.value(l_s);
let b = r.value(r_s);
std::iter::once(op(a, b)).collect()
std::iter::once(op(a, b) ^ neg).collect()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also please add a documentation comment explaining how to interpret the neg argument?

}
None => values_ne,
})
Ok(distinct(&v1, &v2)?.values().clone())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@tustvold tustvold merged commit bce0b41 into apache:master Aug 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Equality kernel where null==null gives true Consider adding is_distinct_from kernels

2 participants