ARROW-6784: [C++][R] Move filter and take for ChunkedArray, RecordBatch, and Table from Rcpp to C++ library#5686
ARROW-6784: [C++][R] Move filter and take for ChunkedArray, RecordBatch, and Table from Rcpp to C++ library#5686nealrichardson wants to merge 18 commits intoapache:masterfrom
Conversation
bkietz
left a comment
There was a problem hiding this comment.
This is a good start. I think you can reduce the number of public facing overloads in favor of the Datum overloads.
There was a problem hiding this comment.
Instead of using Concatenate here, I think it'd be better to use std::vector<ArrayVector> RechunkArraysConsistently(const std::vector<ArrayVector>&); (defined in array.h). After that the the chunks will be equal length, suitable for filtering.
There was a problem hiding this comment.
Thanks, I'll look into that.
There was a problem hiding this comment.
This can be addressed later, but: there's an unfortunate missed optimization here. Since we're reusing the same filter for each column we don't need to recount the set bits in each chunk of the filter for every column (see the record batch overload which takes advantage of this optimization).
There was a problem hiding this comment.
Another missed (deferred) optimization would be to parallelize the filtering across the columns rather than iterating in serial.
I'll take a look at the one you reference and see if I can translate it for tables; otherwise I'll make followup Jiras.
cpp/src/arrow/compute/kernels/take.h
Outdated
There was a problem hiding this comment.
I don't think it's necessary to pre-emptively add every possible permutation here. I added the RecordBatch overload because I specifically planned to use it in arrow::dataset::. Although it saves a few lines of code which would otherwise be necessary to box/unbox from compute::Datum, I think most consumers should rely on the Datum overload.
There was a problem hiding this comment.
It's not preemptive, there are different implementations for each signature. Maybe you can show me what you mean; I haven't worked with Datum objects.
There was a problem hiding this comment.
Instead of concatenating here, you could flatten chunks from current_chunk directly into new_chunks:
ArrayVector new_chunks;
for (const auto& indices_chunk : indices.chunks()) {
std::shared_ptr<ChunkedArray> taken;
RETURN_NOT_OK(Take(ctx, values, *indices_chunk, options, &taken));
std::move(taken->chunks()->begin(), taken->chunks()->end(), std::back_inserter(new_chunks));
}There was a problem hiding this comment.
(of course it's a moot point until Take(ChunkedArray values, Array indices) can also avoid concatenation)
There was a problem hiding this comment.
The point of this concatenate was that the resulting chunks should correspond to the chunks defined by indices, as @jorisvandenbossche suggested. This, of course, gets us back to the original discussion of what chunks are for, whether they are purely an internal implementation detail or something that users should govern, what "optimal" chunking is, etc.
There was a problem hiding this comment.
I think optimal chunking for output from a kernel is whatever allows the consumer greatest control over allocation and other performance overhead. Based on this: concatenation should be kept to a minimum since that generates new allocations instead of cheaply slicing existing ones. As a secondary consideration, the chunked array should have as few chunks as possible since large contiguous chunks can be processed more efficiently than lots of short chunks. Based on that: an output chunked array should not contain empty chunks.
|
One other thing: it's not necessary to use the |
nealrichardson
left a comment
There was a problem hiding this comment.
Thanks @bkietz!
There was a problem hiding this comment.
The point of this concatenate was that the resulting chunks should correspond to the chunks defined by indices, as @jorisvandenbossche suggested. This, of course, gets us back to the original discussion of what chunks are for, whether they are purely an internal implementation detail or something that users should govern, what "optimal" chunking is, etc.
There was a problem hiding this comment.
Another missed (deferred) optimization would be to parallelize the filtering across the columns rather than iterating in serial.
I'll take a look at the one you reference and see if I can translate it for tables; otherwise I'll make followup Jiras.
cpp/src/arrow/compute/kernels/take.h
Outdated
There was a problem hiding this comment.
It's not preemptive, there are different implementations for each signature. Maybe you can show me what you mean; I haven't worked with Datum objects.
There was a problem hiding this comment.
Thanks, I'll look into that.
683b73d to
ba72253
Compare
|
@bkietz I've deferred the remaining refactoring to these followup issues: ARROW-6959, ARROW-7009, ARROW-7012 |
bkietz
left a comment
There was a problem hiding this comment.
This looks good. You've got a few flaky CI failures and a conversion warning:
https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/28439412/job/u8d468vtqmqn62wo#L994
Please fix those declarations then I think this is ready to land
6c17594 to
6b62e62
Compare
|
CI is green except for the macos travis job that's broken on master. |
|
Reviewing this now. |
wesm
left a comment
There was a problem hiding this comment.
Some minor comments. Let me know if you want to make more changes, but at minimum I think there's some follow up refactoring to do re: Datum-based APIs
There was a problem hiding this comment.
We should create a function that accepts a lambda for chunked evaluation so this logic can be reused in other places. Does not have to happen in this patch
There was a problem hiding this comment.
Same with this logic. I'm not sure about the concatenation part, it seems like you would want to split larger chunks into smaller pieces, yielding an output that has more chunks than the input
e.g.
array chunks [10, 10, 10, 3]
filter chunks [5, 5, 5, 15, 3]
output chunks [5, 5, 5, 5, 10, 3]
There was a problem hiding this comment.
There was a problem hiding this comment.
nit: not fond of condensed variable names like schm
There was a problem hiding this comment.
If I don't do something like this, I collide with the function named schema:
/Users/enpiar/Documents/ursa/arrow/cpp/src/arrow/compute/kernels/filter_test.cc:492:17: error: variable 'schema' declared with deduced type 'auto' cannot appear in its own initializer
auto schema = schema(fields);
^
There was a problem hiding this comment.
This is wasteful because values is going to be concatenated over and over for each chunk in indices. Can you add a note here to note this is bad and open a follow up JIRA?
There was a problem hiding this comment.
I'll add the note. I think this falls under https://issues.apache.org/jira/browse/ARROW-7012.
cpp/src/arrow/compute/kernels/take.h
Outdated
There was a problem hiding this comment.
Frankly I would prefer a Take(FunctionContext*, const Datum&, const Datum&, ...) based API to this combinatorial explosion of functions. If this is not done in this PR, can you mark these functions as experimental so that we can change things without need for a deprecation cycle?
There was a problem hiding this comment.
Will mark as experimental. See also https://issues.apache.org/jira/browse/ARROW-6959 about Datum policy.
There was a problem hiding this comment.
Per comments on Take below I think that using Datum and having fewer public APIs would be better. There are implicit ctors for Datum to make the usage easier
There was a problem hiding this comment.
I briefly looked into refactoring to use Datum but it didn't seem like a good use of my time right now to figure it out. https://issues.apache.org/jira/browse/ARROW-7009 is for someone else to pick that up.
748140a to
9f37a69
Compare
…thout concatenation
Co-Authored-By: Benjamin Kietzman <[email protected]>
9f37a69 to
21dbd26
Compare
|
Rebased, will merge when green unless there's objection. |
Codecov Report
@@ Coverage Diff @@
## master #5686 +/- ##
==========================================
+ Coverage 88.99% 89.56% +0.56%
==========================================
Files 1006 814 -192
Lines 137246 121983 -15263
Branches 1501 0 -1501
==========================================
- Hits 122142 109252 -12890
+ Misses 14739 12731 -2008
+ Partials 365 0 -365
Continue to review full report at Codecov.
|
Uh oh!
There was an error while loading. Please reload this page.