ARROW-12712: [C++] String repeat kernel#11023
ARROW-12712: [C++] String repeat kernel#11023edponce wants to merge 84 commits intoapache:masterfrom
Conversation
|
Related to a replicate operation, there was a previous discussion in Zulip chat of having a general replicate functionality where string repeat is a particular case. Arrow already has |
|
You may instead be interested in two things I added recently: ArrayBuilder::AppendScalar(const Scalar&, int64_t) and ArrayBuilder::AppendArraySlice. This would let you implement a generalized repeat without allocating and concatenating lots of intermediate arrays, and would let you preallocate the final array as well. |
|
@lidavidm Those Also, note that the current |
|
Sure, I'm talking about more general repeat methods, though I guess now I question what you might want to repeat other than binary-like types and I suppose lists. |
|
The current str_repeat(['a', 'b', 'c'], repeats=[1,2,3]) # ['a', 'bb', 'ccc']Possible solution: Override the |
To me, this means that the kernel is simply a binary kernel. |
|
I agree that this is a binary kernel because the number of repeats is required. |
6942fa8 to
23d7516
Compare
|
This PR depends on #11082 (ARROW-13898) which adds supports for string binary compute functions. |
7e0babd to
f487172
Compare
|
@edponce Please ping when this is ready for review. Thanks! |
|
Ready for review cc @pitrou |
|
Based on the semantics of the scalar binary kernels, I am adding a kernel exec generator for binary string transforms. This includes an output adapter and array iterator. |
|
Feel free to undraft when this is ready @edponce . |
d4ed4f8 to
60c73c7
Compare
60c73c7 to
b1d4cb1
Compare
|
I have the following questions which I am not sure how to resolve:
Warning: Expression strrep(x, 3) not supported in Arrow; pulling data into R |
|
For 1: why support floating point or boolean arguments in the first place? It seems quite odd. I think those should be explicit casts if the user wants them, and they can choose safe/unsafe cast as appropriate. For 2: I don't think there's a clear argument either way. If the difference in behavior is critical in a particular application, the data could always be checked/massaged either way beforehand. Otherwise I would lean towards explicitly erroring. (You could argue that in Python, you're explicitly calling into Arrow and hence it's clear there may be a difference, while in R the user is likely using dplyr and so a difference may not be top-of-mind.) |
|
I agree with David about erroring: it seems a bit odd to me that negative values are silently assumed to be 0. In the first message, you have the following error: That seems odd, I would expect |
d5d98f9 to
6aabb4f
Compare
|
Renamed R internal function |
|
Benchmark runs are scheduled for baseline = 5897217 and contender = 0ead7c9. 0ead7c9 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
This PR adds the string repeat compute function named "string_repeat". String repeat is a binary function that accepts Binary/StringType(s) and the repetition value(s). Repetition values can be:
To support inputs of different shapes for this kernel, kernel exec generators and base classes for binary string transforms are also included.