ARROW-3121: [C++] Mean aggregate kernel#3708
ARROW-3121: [C++] Mean aggregate kernel#3708fsaintjacques wants to merge 7 commits intoapache:masterfrom
Conversation
7f31a91 to
91d9b63
Compare
Codecov Report
@@ Coverage Diff @@
## master #3708 +/- ##
==========================================
+ Coverage 87.65% 87.68% +0.02%
==========================================
Files 701 703 +2
Lines 85747 85824 +77
Branches 1200 1200
==========================================
+ Hits 75162 75254 +92
+ Misses 10466 10456 -10
+ Partials 119 114 -5
Continue to review full report at Codecov.
|
02a0bcf to
2d4c2ba
Compare
|
I think the nodejs failure is unrelated and this is ready to merge. |
|
@bkietz @pitrou @emkornfield This is ready to review. |
cpp/src/arrow/compute/test-util.h
Outdated
There was a problem hiding this comment.
Should assert false or abort on other datum kinds, no?
There was a problem hiding this comment.
There was a problem hiding this comment.
I'll also try to include this while implementing ARROW-3123.
cpp/src/arrow/compute/test-util.h
Outdated
There was a problem hiding this comment.
😄
I think you want typename std::enable_if<!IsFloatingPoint<Type>::value>::type, though
There was a problem hiding this comment.
ugh, shame on me! Clearly something is wrong then.
There was a problem hiding this comment.
This (and ConsumeSparse, ConsumeTiny) looks pretty handy, maybe later we could refactor it into BitUtil::VisitBitmap() or so
There was a problem hiding this comment.
Soon the complexity of future needs will outgrow the current implementation, e.g. we'll need to mix with selection vectors/bitmap, groups, etc. I intend to open a discussion in the mailing list about providing iterator facilities with many options (batch, masked, groups).
There was a problem hiding this comment.
crazy edge case: if we're summing floating point and one of the null masked values happens to be NaN then multiplying by zero will not prevent the sum from also being NaN
There was a problem hiding this comment.
Good catch, I'll see what I can do without too much change.
There was a problem hiding this comment.
What is the performance diff of doing ternary operator?
There was a problem hiding this comment.
Also would it pay to have a constant expression (or just static array) in bit utils that precomputes all the shifts similar to the popcount array below
2d4c2ba to
7929140
Compare
emkornfield
left a comment
There was a problem hiding this comment.
I don't think I saw any blocking issues mostly style stuff and documentation suggestions (please do fix the NaN issue raised).
There was a problem hiding this comment.
What is the performance diff of doing ternary operator?
|
Le 26/02/2019 à 14:27, François Saint-Jacques a écrit :
Soon the complexity of future needs will outgrow the current
implementation, e.g. we'll need to mix with selection vectors/bitmap,
groups, etc. I intend to open a discussion in the mailing list about
providing iterator facilities with many options (batch, masked, groups).
Also take a look at existing facilities (ArrayDataVisitor) in
visitor-inline.h
Regards
Antoine.
|
|
@pitrou, the initial (sum) implementation used the visitor facility. |
|
This is good to go, the failure is from python https://issues.apache.org/jira/browse/ARROW-4684 |
| Datum(std::make_shared<ScalarType>(0.0))); | ||
|
|
||
| ValidateMean<TypeParam>(&this->ctx_, "[1, 1, 1, 1, 1, 1, 1, 1]", | ||
| Datum(std::make_shared<ScalarType>(1.0))); |
There was a problem hiding this comment.
For self-documenting code you could do:
auto null = std::make_shared<ScalarType>(0.0, false);
Also keep in mind that Datum has implicit ctors, so these Datum(...) are unneeded
| if (data_type == nullptr) | ||
| return Status::Invalid("Datum must be array-like"); | ||
| else if (!is_integer(data_type->id()) && !is_floating(data_type->id())) | ||
| return Status::Invalid("Datum must contain a NumericType"); |
Implements the mean (average) kernel aggregates on numeric columns. The final type is always a double. Refactored the Sum kernel implementation to share common parts, notably the consume part is identical. Only the Finalize and output type differ.