ARROW-1569: [C++] Kernel functions for determining monotonicity (ascending or descending) for well-ordered types by mbrobbel · Pull Request #11937 · apache/arrow

mbrobbel · 2021-12-13T13:29:19Z

Initially I tried to implement this as a ScalarAggregateFunction (as suggested in the issue), however given that there is no way to express sensitivity to order, it's currently not possible to correctly implement the ScalarAggregator::MergeFrom function. This is now implemented as a VectorFunction.

I'm still working on supporting more types:

Todo:

Documentation

I'll create follow-up JIRA issues to:

Add support for half floats, missing temporal types (INTERVAL_DAY_TIME and INTERVAL_MONTH_DAY_NANO) and decimal types.
Add support for String arrays (via lexicographical order):

Input: utf8
[
  "a",
  "b",
  "c"
]

Output: struct<increasing:boolean, strictly_decreasing:boolean, decreasing:boolean, strictly_decreasing:boolean>
{increasing: true, strictly_increasing: true, decreasing: false, strictly_decreasing: false}

Implement this function for list arrays with well-ordered element types:

Input: list<uint8>
[
  [1, 2, 3],
  [3, 3, 2],
  [1, 2, 1, 2]
]

Output list<struct<increasing:boolean, strictly_decreasing:boolean, decreasing:boolean, strictly_decreasing:boolean>>
[
  {increasing: true, strictly_increasing: true, decreasing: false, strictly_decreasing: false},
  {increasing: false, strictly_increasing: false, decreasing: true, strictly_decreasing: false},
  {increasing: false, stirctly_increasing: false, decreasing: false, strictly_decreasing: false}
]

cc @bkietz

github-actions · 2021-12-13T13:40:08Z

https://issues.apache.org/jira/browse/ARROW-1569

bkietz

Just a few comments for now

cpp/src/arrow/compute/api_vector.h

Co-authored-by: Benjamin Kietzman <[email protected]>

edponce · 2021-12-16T21:29:49Z

cpp/src/arrow/compute/kernels/vector_is_monotonic.cc

+        // Approximately equal within some error bound (epsilon).
+        (options.floating_approximate &&
+         (fabs(current - next) <=
+          static_cast<typename DataType::c_type>(options.epsilon))) ||


There exists support for floating-point comparisons, maybe you can reuse this here.
Also, this check is not considering sign bit for special cases such as: signed zeros and signed Inf.
Ex. { -0.0, 0.0 } != { 0.0, -0.0 }.

@pitrou Should we consider signed zero/Inf here? Not sure if sorting function does it. In any case, consistency is desired and can be resolved in a follow-up JIRA.

Sorting doesn't, AFAIR. Signed zeros are considered equal, I'm not sure there's any particular reason to deviate from that (what are the use cases for this kernel?).

edponce · 2021-12-16T21:43:01Z

cpp/src/arrow/compute/kernels/vector_is_monotonic.cc

+template <typename DataType>
+enable_if_not_floating_point<DataType, bool> isnan(
+    const util::optional<typename DataType::c_type>& opt) {
+  return false;


Ideally, isnan() should only be used in floating-point-enabled functions, so maybe you can function overload (using enable-if magic) the code blocks that use isnan() in a more generic manner.

My comment below suggests a change that would allow you to get rid of the enable_if_not_floating_point<...> isnan() variant.

edponce · 2021-12-16T21:45:31Z

cpp/src/arrow/compute/kernels/vector_is_monotonic.cc

+  auto options = IsMonotonicState::Get(ctx);
+
+  // Check batch size
+  if (batch.values.size() != 1) {


AFAIK, the number of arguments to a function are validated in the compute layer mechanism. When the IsMonotonic function is registered, it specifies a single input argument, so this should already be guaranteed.

Directly invocation of this kernel is not possible through the public API, however internally this function could be invoked directly and skip those checks. @bkietz what do you suggest?

IMHO, this check is trivial, it doesn't hurt having it.

cpp/src/arrow/compute/kernels/vector_is_monotonic.cc

edponce · 2021-12-16T21:59:44Z

cpp/src/arrow/compute/kernels/vector_is_monotonic.cc

+  // Return early if there are NaNs, zero elements or one element in the array.
+  // And return early if there are only nulls.
+  if (array.length() <= 1 || array.null_count() == array.length()) {
+    if (std::any_of(array.begin(), array.end(), isnan<DataType>)) {


Everything in this functions seems general enough to handle most data types, except for isnan. Maybe specialize this code block.

What would that look like?

After more careful thought, what we want is for only floating-point types to do the std::any() check. You can use TypeTraits for those that have a c_type defined to get a type_id variable which can be checked during runtime.

if (array.length() <= 1 || array.null_count() == array.length()) { auto type_id = TypeTraits<DataType>::type_singleton(); if (!is_floating(type_id) || std::any_of(array.begin(), array.end(), std::isnan)) { return IsMonotonicOutput(false, false, false, false, out); } else { ... } }

is_floating(type_id) is defined here.

P.S. I did not ran this code, but something along these lines should work.

edponce · 2021-12-16T22:23:48Z

cpp/src/arrow/compute/api_vector.h

+    /// Use max value of element type as the value of nulls.
+    /// Inf for floating point numbers.
+    USE_MAX_VALUE
+  };


Ordering of nulls and NaNs were also discussed in the sorting function. I would expect that IsMonotonic and sorting function are consistent. That is, if a sort operation is performed first, then the corresponding IsMonotonic should result in true.

Yes, I agree that a sort before invoking this kernel should result in true for the corresponding check. However I feel the null handling variants are a bit confusing: AtStart defines NaN > null and AtEnd defines NaN < null. Also, the sorting kernel can ignore equality, but this kernels considers it to check if values are unique (strictly increasing/decreasing).

I think if we want to allow users to define order of unordered values (both for sorting and this kernel) we need something like this:

bool compare_nulls = false; // default: any null results in false outputs (or error in case of sort) bool compare_nans = false; // default: any nan results in false outputs (or error in case of sort) // these are not needed when sorting bool nulls_equal = false; // when nulls are compared, are they considered equal? bool nans_equal = false; // when nans are compared, are they considered equal? // when both nulls and nans are compared enum Ordering { Less, Equal, Greater } Ordering nan_compared_with_null; // when comparing nulls and nans, what ordering should be used?

cpp/src/arrow/compute/kernels/vector_is_monotonic.cc

edponce · 2021-12-16T22:41:45Z

Some general comments:

IsMonotonic needs to be consistent with corresponding sorting functions, such that IsMonotonic(input) == (Sort(input) == input).
Currently IsMonotonic outputs a struct describing the monotonic properties of the data. What are your thoughts on having a convenience wrapper function that receives FunctionOptions with a single requested monotonic behavior? For example, IsMonotonic(input, MonotonicOptions.StrictlyIncreasing). This would provide a more readable API for client code and bypass the need to unpack the StructScalar output to check the corresponding monotonic behavior of interest.

mbrobbel · 2021-12-17T12:24:34Z

* `IsMonotonic` needs to be consistent with corresponding [sorting functions](https://arrow.apache.org/docs/cpp/compute.html#sorts-and-partitions), such that `IsMonotonic(input) == (Sort(input) == input)`.

I agree.

* Currently `IsMonotonic` outputs a struct describing the monotonic properties of the data. What your thoughts on having a convenience wrapper function that receives `FunctionOptions` with a single requested monotonic behavior? For example, `IsMonotonic(input, MonotonicOptions.StrictlyIncreasing)`. This would provide a more readable API for client code and bypass the need to unpack the `StructScalar` output to check the corresponding monotonic behavior of interest.

I initially set it up like that but @bkietz suggested to output a struct scalar instead (like the min/max kernel).

edponce · 2021-12-17T18:26:29Z

cpp/src/arrow/compute/api_vector.h

+ public:
+  enum NullHandling {
+    /// Ignore nulls.
+    IGNORE_NULLS,


Based on the other enum names, use only IGNORE, since enum NullHandling already specifies this is for nulls.

I had that initially, but IGNORE caused compilation issues on Windows.

Well, nevermind.

edponce · 2021-12-20T15:00:39Z

cpp/src/arrow/compute/kernels/vector_is_monotonic.cc

+
+  // Safety:
+  // - Made sure that the input datum is an array.
+  const std::shared_ptr<ArrayData>& array_data = input.array();


You can use auto array_data = input.array().

edponce · 2021-12-20T15:22:41Z

cpp/src/arrow/compute/kernels/vector_is_monotonic.cc

+
+template <typename DataType>
+enable_if_not_floating_point<DataType> IsMonotonicCheck(
+    const typename DataType::c_type& current, const typename DataType::c_type& next,


After reviewing this PR, I think implementations of IsMonotonicCheck can be categorized as follows based on DataType:

have c_type (e.g., primitive numeric, datetime, timestamp)

c_type is floating-point

c_type is not floating-point

do not have c_type (binary, string, intervals)

custom implementation for binary and string

custom implementation for intervals

There would be at least 4 type-specific implementations of IsMonotonicCheck, where the
the enable_if for this case would be of the form (pseudocode) enable_if_has_c_type and enable_if_not_floating_point

edponce · 2021-12-20T15:25:00Z

cpp/src/arrow/compute/kernels/vector_is_monotonic.cc

+  return Status::OK();
+}
+
+template <typename DataType>


Nit: ArrowType would be more appropriate than DataType.

edponce · 2021-12-20T15:29:08Z

cpp/src/arrow/compute/kernels/vector_is_monotonic.cc

+}
+
+template <typename DataType>
+Status IsMonotonic(KernelContext* ctx, const ExecBatch& batch, Datum* out) {


Since this version of IsMonotonic requires DataType having a c_type, this needs to be guarded with enable_if_has_c_type. You would need to make other versions for binary/string and interval types.

edponce · 2021-12-20T15:54:05Z

cpp/src/arrow/compute/kernels/vector_is_monotonic.cc

+  // Return early if there are NaNs, zero elements or one element in the array.
+  // And return early if there are only nulls.
+  if (array.length() <= 1 || array.null_count() == array.length()) {
+    if (std::any_of(array.begin(), array.end(), isnan<DataType>)) {


After more careful thought, what we want is for only floating-point types to do the std::any() check. You can use TypeTraits for those that have a c_type defined to get a type_id variable which can be checked during runtime.

if (array.length() <= 1 || array.null_count() == array.length()) { auto type_id = TypeTraits<DataType>::type_singleton(); if (!is_floating(type_id) || std::any_of(array.begin(), array.end(), std::isnan)) { return IsMonotonicOutput(false, false, false, false, out); } else { ... } }

is_floating(type_id) is defined here.

P.S. I did not ran this code, but something along these lines should work.

amol- · 2023-03-30T17:19:12Z

Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍

mbrobbel added 3 commits December 9, 2021 10:31

Add IsMonotonic vector function

23380f6

Return StructScalar from IsMonotonic function

52ddd12

Fix null handling and avoid object slicing

a0ae495

github-actions bot added the Component: C++ label Dec 13, 2021

bkietz self-requested a review December 13, 2021 19:56

bkietz requested changes Dec 13, 2021

View reviewed changes

cpp/src/arrow/compute/api_vector.h Outdated Show resolved Hide resolved

cpp/src/arrow/compute/api_vector.h Outdated Show resolved Hide resolved

mbrobbel and others added 2 commits December 14, 2021 10:18

Update comment to allow doxygen to detect docstrings

dd66eb3

Co-authored-by: Benjamin Kietzman <[email protected]>

Modify function options to handle floating point numbers

347e102

mbrobbel force-pushed the arrow-1569 branch from ca0d0eb to ffb49a2 Compare December 15, 2021 10:07

Add tests for floating point numbers

841829c

mbrobbel force-pushed the arrow-1569 branch from ffb49a2 to 841829c Compare December 15, 2021 12:08

Rename NullHandling variants

275dfe8

mbrobbel force-pushed the arrow-1569 branch from 78b65ca to 275dfe8 Compare December 15, 2021 13:03

mbrobbel requested a review from bkietz December 15, 2021 14:15

Add tests for supported temporal types

6816819

mbrobbel marked this pull request as ready for review December 16, 2021 09:37

edponce suggested changes Dec 16, 2021

View reviewed changes

edponce reviewed Dec 16, 2021

View reviewed changes

cpp/src/arrow/compute/kernels/vector_is_monotonic.cc Outdated Show resolved Hide resolved

Fix early exit comment and improve null handling

4284fda

edponce reviewed Dec 17, 2021

View reviewed changes

edponce suggested changes Dec 20, 2021

View reviewed changes

asfimport mentioned this pull request Jul 12, 2022

[C++] Kernel functions for determining monotonicity (ascending or descending) for well-ordered types #17582

Closed

amol- closed this Mar 30, 2023

Conversation

mbrobbel commented Dec 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 13, 2021

Uh oh!

bkietz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

edponce Dec 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

edponce Dec 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

edponce commented Dec 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbrobbel commented Dec 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

edponce Dec 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amol- commented Mar 30, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mbrobbel commented Dec 13, 2021 •

edited

Loading

edponce Dec 16, 2021 •

edited

Loading

edponce Dec 16, 2021 •

edited

Loading

edponce commented Dec 16, 2021 •

edited

Loading

mbrobbel commented Dec 17, 2021 •

edited

Loading

edponce Dec 20, 2021 •

edited

Loading