ARROW-13627: [C++] Fully support ScalarAggregateOptions in (hash) any/all/sum/product/mean#10942
ARROW-13627: [C++] Fully support ScalarAggregateOptions in (hash) any/all/sum/product/mean#10942lidavidm wants to merge 6 commits intoapache:masterfrom
Conversation
|
D'oh, I forgot to check the R tests that motivated this in the first place. |
|
Blocked on ARROW-13638. |
r/src/compute.cpp
Outdated
There was a problem hiding this comment.
FWIW, this appeared unused to me.
nealrichardson
left a comment
There was a problem hiding this comment.
R changes look great; one question about the C++
There was a problem hiding this comment.
Is there a possible optimization: if options.skip_nulls, either check the bitmask up front for missings and exit early if any, or exit after the first one is found? It looks like as it is, we still go through and count/sum/etc. all non-null values always.
There was a problem hiding this comment.
Ah yes, we can short-circuit as soon as nulls_observed if we have !skip_nulls. Updated, thanks for pointing this out.
There was a problem hiding this comment.
I'm curious about this condition: if there are nulls and options.skip_nulls is false, this kernel can still return true (when this->any is true)?
There was a problem hiding this comment.
Yes, it looked weird to me as well, but that is how Kleene logic works, and you can observe this in R:
> any(c(NA), na.rm = FALSE)
[1] NA
> any(c(NA, TRUE), na.rm = FALSE)
[1] TRUE
> any(c(NA, FALSE), na.rm = FALSE)
[1] NA
> any(c(), na.rm = FALSE)
[1] FALSEThere was a problem hiding this comment.
Same question here: if the input is [false, true, null] and skip_nulls is false, then the result is false rather than null?
There was a problem hiding this comment.
Yes, and this matches base R/dplyr's behavior:
> all(c(FALSE, TRUE, NA), na.rm = FALSE)
[1] FALSE
pitrou
left a comment
There was a problem hiding this comment.
Apart from the two (probably silly) questions I asked about the any/all semantics, here are some comments.
There was a problem hiding this comment.
Why remove the non-zero min_count tests here?
There was a problem hiding this comment.
The cases as written were pointless, so I removed them. I've added some new cases instead.
There was a problem hiding this comment.
Sidenote: the Sum, Product and Mean aggregators probably have a lot of code in common. Do you think it can be factored out in some kind of mixin or base class?
Or, conversely, that the operation-specific code can be moved into a separate class on which the main aggregator implementation would be templated?
There was a problem hiding this comment.
I've made all 5 kernels use CRTP. GroupedMeanImpl is kind of iffy under this pattern but the other 4 kernels consolidate nicely.
There was a problem hiding this comment.
Can probably shortcut here:
if (!BitUtil::GetBit(seen, *g) && BitUtil::GetBit(bitmap, position)) {
BitUtil::SetBitTo(seen, *g);
}There was a problem hiding this comment.
Similarly to the remark above about Sum / Mean / Product, I also wonder if Any and All can be reconciled.
There was a problem hiding this comment.
Call this SumMeanProductKeepNulls?
| ScalarAggregateOptions(/*skip_nulls=*/false, /*min_count=*/4)); | ||
| ValidateMean<TypeParam>(json, null_result, | ||
| ScalarAggregateOptions(/*skip_nulls=*/false, /*min_count=*/15)); | ||
| ScalarAggregateOptions(/*skip_nulls=*/false, /*min_count=*/0)); |
There was a problem hiding this comment.
It seems like there should be some tests with skip_nulls=false and a non-zero min_count?
This should let R easily support na.RM = TRUE / FALSE by setting skip_nulls = false / true (respectively).