ARROW-12959: [C++][R] Option for is_null(NaN) to evaluate to true#10896
ARROW-12959: [C++][R] Option for is_null(NaN) to evaluate to true#10896Christian8491 wants to merge 73 commits intoapache:masterfrom Christian8491:ARROW-12959-Option-for-is-nullNaN-to-evaluate-to-tru
Conversation
There was a problem hiding this comment.
This can probably made much simpler, e.g.:
bool* out_value = &checked_cast<BooleanScalar*>(out)->value;
if (in.type->id() == Type::FLOAT) {
*out_value = !options.nan_is_null || !std::isnan(internal::UnboxScalar<FloatType>::Unbox(in));
}
else if (in.type->id() == Type::DOUBLE) {
*out_value = !options.nan_is_null || !std::isnan(internal::UnboxScalar<DoubleType>::Unbox(in));
}
else {
*out_value = true;
}
return Status::OK();There was a problem hiding this comment.
I would add a check for validity prior to checking float type to prevent unnecessary checks when null bit is enabled. Nit: Also, I would change logic to to use && instead of ||, feels a bit more readable (and less characters).
bool* out_value = &checked_cast<BooleanScalar*>(out)->value;
if (in.is_valid) {
switch (in.type->id()) {
case Type::FLOAT:
*out_value = options.nan_is_null && std::isnan(internal::UnboxScalar<FloatType>::Unbox(in));
case Type::DOUBLE:
*out_value = options.nan_is_null && std::isnan(internal::UnboxScalar<DoubleType>::Unbox(in));
default:
*out_value = false;
}
} else {
*out_value = true;
}There was a problem hiding this comment.
@pitrou I would be curious if it is worth it to have a specialized version for floating point types (similar to how arithmetic kernels are implemented), as the options.nan_is_null check does not applies to other data types.
There was a problem hiding this comment.
This is the version for Scalar inputs. Micro-optimizing it is futile. The version for Array inputs may be more interesting ;-)
There was a problem hiding this comment.
You can put this in the anonymous namespace above.
There was a problem hiding this comment.
It was addressed on 2e772c1
There was a problem hiding this comment.
This is comparing pointer values, which is incorrect, even though it will do what you expect most of the time.
(one could e.g. instantiate a separate instance using std::make_shared<FloatType>())
There was a problem hiding this comment.
Remove the break statement for the default case.
There was a problem hiding this comment.
Removed on 9a69e3e
There was a problem hiding this comment.
Note the check for NaN values only applies to floating-point types and when nan_is_null is true, other types/cases can use the logic as it was.
In the case of ArrayData there are 3 scenarios:
a) arr.GetNullCount() == arr.length // All data is null
b) arr.GetNullCount() == 0 // No data is null
c) arr.MayHaveNulls() == true // Some data is null
IIUC, the null bitmap of the input ArrayData is not guaranteed to be consistent with the data (an ArrayData can be malformed bc buffer values can be modified directly). Scenarios (a)-(b) invoke arr.GetNullCount() which iterates through all the arr values and update the null count.
Given that scenarios (b)-(c) are the common case and the array data has to be traversed to identify the NaN values, (as an optimization) I suggest to not check the null count at all. Nevertheless, only check for NaNs in non-null indices.
There was a problem hiding this comment.
I have pushed f404910 which contains the ArrayData validity for nulls (not sure if it's the best way to do).
There is a way to test those changes in R ? As the ticket is also related to R language.
There was a problem hiding this comment.
No need to test kernel implementation in R (or Python), first bindings have to be put in place with the desired default option. The C++ kernels are invoked via the language bindings.
There was a problem hiding this comment.
@Christian8491 when this PR is ready, the C++ code is all reviewed and approved, and the CI is all green, please let me know and I can push a commit that changes the R bindings to use this new option.
There was a problem hiding this comment.
@pitrou The is_null compute function most of the time will iterate through the bitmap of the input ArrayData and when nan_is_null option is set, it can increase the null count. Would it be a good idea to update the null count of the input array (arr.SetNullCount(...)) as a side-effect of invoking this compute function?
There was a problem hiding this comment.
Why would you update the null count of the input array? I'm not sure I understand what you're proposing.
There was a problem hiding this comment.
I see your point now, thanks! Modifying the input array does not makes sense because nan_is_null property applies only to the intermediate/output of the compute function. Please ignore these comments.
…N-to-evaluate-to-t ARROW-12959: [C++] Option for is_null(NaN) to evaluate to true
docs/source/cpp/compute.rst
Outdated
|
|
||
| * \(9) Output is true iff the corresponding input element is null. | ||
| * \(9) Output is true if the corresponding input element is null or if NaN | ||
| values are treated as null via the :struct:`NanNullOptions`. |
There was a problem hiding this comment.
Rephrase to Output is true if the corresponding input element is null. NaN values can be considered as null via ...
|
Add Python tests that make use of the arr = pa.array([1, 2, 3, None, np.nan])
arr.is_null() # expected = pa.array([False, False, False, True, False])
arr.is_null(nan_is_null=True) # expected = pa.array([False, False, False, True, True]) |
|
@ianmcook could you provide me some help for R bindings in order to use this new option. |
|
@Christian8491 could you please confirm that it is safe to set |
Yes @ianmcook as the logic demands a first check for floating type cc @edponce |
|
I pushed 9a4f39a to update the R bindings to use this option. Thanks again @Christian8491 for doing this! |
pitrou
left a comment
There was a problem hiding this comment.
Here are some more comments. I'm going to push the required changes myself.
cpp/src/arrow/compute/api_scalar.h
Outdated
| int64_t start, stop, step; | ||
| }; | ||
|
|
||
| class ARROW_EXPORT NanNullOptions : public FunctionOptions { |
There was a problem hiding this comment.
I suggest renaming this NullOptions, in case other values than NaN need to be considered null at some point.
| {"values"}); | ||
| const FunctionDoc is_null_doc( | ||
| "Return true if null, NaN values can be considered as null", | ||
| ("For each input value, emit true if the value is null. Default behavior is to emit " |
There was a problem hiding this comment.
By convention, function descriptions are explicitly line-wrapped with \n characters. I suggest doing the same here.
| CheckScalarUnary("is_null", MakeScalar(std::numeric_limits<double>::infinity()), | ||
| MakeScalar(false), &options); | ||
| CheckScalarUnary("is_null", MakeNullScalar(float64()), MakeScalar(true), &options); | ||
| } |
There was a problem hiding this comment.
I think the repetition can easily be avoided here, for example by making this a templated test.
| return Expression._call("is_valid", [self]) | ||
|
|
||
| def is_null(self): | ||
| def is_null(self, bint nan_is_null=False): |
There was a problem hiding this comment.
I would make this argument keyword-only.
python/pyarrow/array.pxi
Outdated
| return 0 | ||
|
|
||
| def is_null(self): | ||
| def is_null(self, nan_is_null=False): |
| StrftimeOptions, | ||
| DayOfWeekOptions, | ||
| NanNullOptions, | ||
| TakeOptions, |
There was a problem hiding this comment.
We should try to maintain this alphabetically-ordered (though DayOfWeekOptions already violates it :-/).
|
|
||
| TEST_F(TestFloatValidityKernels, FloatScalarIsNull) { | ||
| CheckScalarUnary("is_null", MakeScalar(42.0f), MakeScalar(false)); | ||
| CheckScalarUnary("is_null", MakeScalar(std::nanf("")), MakeScalar(false)); |
There was a problem hiding this comment.
These separate tests shouldn't be necessary, as CheckScalarUnary already tests Scalar values implicitly when Array values are given.
|
Thanks @pitrou for the feedback! Will be more carefully next time with details like |
No description provided.