ARROW-1569: [C++] Kernel functions for determining monotonicity (ascending or descending) for well-ordered types#11937
ARROW-1569: [C++] Kernel functions for determining monotonicity (ascending or descending) for well-ordered types#11937mbrobbel wants to merge 9 commits intoapache:mainfrom
Conversation
Co-authored-by: Benjamin Kietzman <[email protected]>
| // Approximately equal within some error bound (epsilon). | ||
| (options.floating_approximate && | ||
| (fabs(current - next) <= | ||
| static_cast<typename DataType::c_type>(options.epsilon))) || |
There was a problem hiding this comment.
There exists support for floating-point comparisons, maybe you can reuse this here.
Also, this check is not considering sign bit for special cases such as: signed zeros and signed Inf.
Ex. { -0.0, 0.0 } != { 0.0, -0.0 }.
There was a problem hiding this comment.
@pitrou Should we consider signed zero/Inf here? Not sure if sorting function does it. In any case, consistency is desired and can be resolved in a follow-up JIRA.
There was a problem hiding this comment.
Sorting doesn't, AFAIR. Signed zeros are considered equal, I'm not sure there's any particular reason to deviate from that (what are the use cases for this kernel?).
| template <typename DataType> | ||
| enable_if_not_floating_point<DataType, bool> isnan( | ||
| const util::optional<typename DataType::c_type>& opt) { | ||
| return false; |
There was a problem hiding this comment.
Ideally, isnan() should only be used in floating-point-enabled functions, so maybe you can function overload (using enable-if magic) the code blocks that use isnan() in a more generic manner.
There was a problem hiding this comment.
My comment below suggests a change that would allow you to get rid of the enable_if_not_floating_point<...> isnan() variant.
| auto options = IsMonotonicState::Get(ctx); | ||
|
|
||
| // Check batch size | ||
| if (batch.values.size() != 1) { |
There was a problem hiding this comment.
AFAIK, the number of arguments to a function are validated in the compute layer mechanism. When the IsMonotonic function is registered, it specifies a single input argument, so this should already be guaranteed.
There was a problem hiding this comment.
Directly invocation of this kernel is not possible through the public API, however internally this function could be invoked directly and skip those checks. @bkietz what do you suggest?
There was a problem hiding this comment.
IMHO, this check is trivial, it doesn't hurt having it.
| // Return early if there are NaNs, zero elements or one element in the array. | ||
| // And return early if there are only nulls. | ||
| if (array.length() <= 1 || array.null_count() == array.length()) { | ||
| if (std::any_of(array.begin(), array.end(), isnan<DataType>)) { |
There was a problem hiding this comment.
Everything in this functions seems general enough to handle most data types, except for isnan. Maybe specialize this code block.
There was a problem hiding this comment.
What would that look like?
There was a problem hiding this comment.
After more careful thought, what we want is for only floating-point types to do the std::any() check. You can use TypeTraits for those that have a c_type defined to get a type_id variable which can be checked during runtime.
if (array.length() <= 1 || array.null_count() == array.length()) {
auto type_id = TypeTraits<DataType>::type_singleton();
if (!is_floating(type_id) || std::any_of(array.begin(), array.end(), std::isnan)) {
return IsMonotonicOutput(false, false, false, false, out);
} else {
...
}
}is_floating(type_id) is defined here.
P.S. I did not ran this code, but something along these lines should work.
| /// Use max value of element type as the value of nulls. | ||
| /// Inf for floating point numbers. | ||
| USE_MAX_VALUE | ||
| }; |
There was a problem hiding this comment.
Ordering of nulls and NaNs were also discussed in the sorting function. I would expect that IsMonotonic and sorting function are consistent. That is, if a sort operation is performed first, then the corresponding IsMonotonic should result in true.
There was a problem hiding this comment.
Yes, I agree that a sort before invoking this kernel should result in true for the corresponding check. However I feel the null handling variants are a bit confusing: AtStart defines NaN > null and AtEnd defines NaN < null. Also, the sorting kernel can ignore equality, but this kernels considers it to check if values are unique (strictly increasing/decreasing).
I think if we want to allow users to define order of unordered values (both for sorting and this kernel) we need something like this:
bool compare_nulls = false; // default: any null results in false outputs (or error in case of sort)
bool compare_nans = false; // default: any nan results in false outputs (or error in case of sort)
// these are not needed when sorting
bool nulls_equal = false; // when nulls are compared, are they considered equal?
bool nans_equal = false; // when nans are compared, are they considered equal?
// when both nulls and nans are compared
enum Ordering { Less, Equal, Greater }
Ordering nan_compared_with_null; // when comparing nulls and nans, what ordering should be used?|
Some general comments:
|
I agree.
I initially set it up like that but @bkietz suggested to output a struct scalar instead (like the min/max kernel). |
| public: | ||
| enum NullHandling { | ||
| /// Ignore nulls. | ||
| IGNORE_NULLS, |
There was a problem hiding this comment.
Based on the other enum names, use only IGNORE, since enum NullHandling already specifies this is for nulls.
There was a problem hiding this comment.
I had that initially, but IGNORE caused compilation issues on Windows.
|
|
||
| // Safety: | ||
| // - Made sure that the input datum is an array. | ||
| const std::shared_ptr<ArrayData>& array_data = input.array(); |
There was a problem hiding this comment.
You can use auto array_data = input.array().
|
|
||
| template <typename DataType> | ||
| enable_if_not_floating_point<DataType> IsMonotonicCheck( | ||
| const typename DataType::c_type& current, const typename DataType::c_type& next, |
There was a problem hiding this comment.
After reviewing this PR, I think implementations of IsMonotonicCheck can be categorized as follows based on DataType:
- have
c_type(e.g., primitive numeric, datetime, timestamp)c_typeis floating-pointc_typeis not floating-point
- do not have
c_type(binary, string, intervals)- custom implementation for
binaryandstring - custom implementation for intervals
- custom implementation for
There would be at least 4 type-specific implementations of IsMonotonicCheck, where the
the enable_if for this case would be of the form (pseudocode) enable_if_has_c_type and enable_if_not_floating_point
| return Status::OK(); | ||
| } | ||
|
|
||
| template <typename DataType> |
There was a problem hiding this comment.
Nit: ArrowType would be more appropriate than DataType.
| } | ||
|
|
||
| template <typename DataType> | ||
| Status IsMonotonic(KernelContext* ctx, const ExecBatch& batch, Datum* out) { |
There was a problem hiding this comment.
Since this version of IsMonotonic requires DataType having a c_type, this needs to be guarded with enable_if_has_c_type. You would need to make other versions for binary/string and interval types.
| // Return early if there are NaNs, zero elements or one element in the array. | ||
| // And return early if there are only nulls. | ||
| if (array.length() <= 1 || array.null_count() == array.length()) { | ||
| if (std::any_of(array.begin(), array.end(), isnan<DataType>)) { |
There was a problem hiding this comment.
After more careful thought, what we want is for only floating-point types to do the std::any() check. You can use TypeTraits for those that have a c_type defined to get a type_id variable which can be checked during runtime.
if (array.length() <= 1 || array.null_count() == array.length()) {
auto type_id = TypeTraits<DataType>::type_singleton();
if (!is_floating(type_id) || std::any_of(array.begin(), array.end(), std::isnan)) {
return IsMonotonicOutput(false, false, false, false, out);
} else {
...
}
}is_floating(type_id) is defined here.
P.S. I did not ran this code, but something along these lines should work.
|
Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍 |
Initially I tried to implement this as a
ScalarAggregateFunction(as suggested in the issue), however given that there is no way to express sensitivity to order, it's currently not possible to correctly implement theScalarAggregator::MergeFromfunction. This is now implemented as aVectorFunction.I'm still working on supporting more types:
Decimal typesTodo:
I'll create follow-up JIRA issues to:
INTERVAL_DAY_TIMEandINTERVAL_MONTH_DAY_NANO) and decimal types.cc @bkietz