ARROW-12712: [C++] String repeat kernel by edponce · Pull Request #11023 · apache/arrow

edponce · 2021-08-27T22:38:19Z

This PR adds the string repeat compute function named "string_repeat". String repeat is a binary function that accepts Binary/StringType(s) and the repetition value(s). Repetition values can be:

a single value applied to all strings
an array of values where each repeat count corresponds to the string in the same position

To support inputs of different shapes for this kernel, kernel exec generators and base classes for binary string transforms are also included.

github-actions · 2021-08-27T22:38:37Z

https://issues.apache.org/jira/browse/ARROW-12712

edponce · 2021-08-28T04:53:35Z

Related to a replicate operation, there was a previous discussion in Zulip chat of having a general replicate functionality where string repeat is a particular case.

Arrow already has MakeArrayFromScalar and RepeatedArrayFactory which use concatenate implementation internally. Can this be used in this PR? These are specifically for Array types and in kernel transform method uses raw pointers.

lidavidm · 2021-08-28T13:13:50Z

You may instead be interested in two things I added recently: ArrayBuilder::AppendScalar(const Scalar&, int64_t) and ArrayBuilder::AppendArraySlice. This would let you implement a generalized repeat without allocating and concatenating lots of intermediate arrays, and would let you preallocate the final array as well.

edponce · 2021-08-28T23:14:12Z

@lidavidm Those ArrayBuilder methods do work to perform this operation but will require not following the common approach used for string kernels based on the already provided StringTransformXXX infrastructure. Specifically, it would require overriding ExecArray() (while duplicating most of it). For how things currently are, I think using the ArrayBuilder/MakeScalar methods for StrRepeat is not preferable.

Also, note that the current StrRepeat implementation only allocates once the entire array for all repeated strings via ExecArray(). StrRepeat overrides MaxCodeunits() to return input_ncodeunits * n_repeats.

lidavidm · 2021-08-28T23:15:40Z

Sure, I'm talking about more general repeat methods, though I guess now I question what you might want to repeat other than binary-like types and I suppose lists.

edponce · 2021-08-30T11:05:41Z

The current StringTransformXXX classes do not easily support non-scalar options. In this PR, we want to be able to do the following:

str_repeat(['a', 'b', 'c'], repeats=[1,2,3])  # ['a', 'bb', 'ccc']

Possible solution: Override the ExecArray of StringTransformExecBase and specialize for kernels that require the current index of the input string. This is done by passing the string index to the transform->Transform(..., i) call. We need to keep in mind that these indexes are relative to the current ExecBatch so we need to offset accordingly.

cc @pitrou @lidavidm

pitrou · 2021-08-30T11:07:17Z

The current StringTransformXXX classes do not easily support non-scalar options. In this PR, we want to be able to do the following:
str_repeat(['a', 'b', 'c'], repeats=[1,2,3])  # ['a', 'bb', 'ccc']

To me, this means that the kernel is simply a binary kernel.

edponce · 2021-08-30T11:09:49Z

I agree that this is a binary kernel because the number of repeats is required.

edponce · 2021-09-04T04:25:16Z

This PR depends on #11082 (ARROW-13898) which adds supports for string binary compute functions.

pitrou · 2021-09-09T08:51:52Z

@edponce Please ping when this is ready for review. Thanks!

edponce · 2021-09-09T08:54:39Z

Ready for review cc @pitrou

cpp/src/arrow/compute/kernels/scalar_string.cc

cpp/src/arrow/compute/kernels/scalar_string_test.cc

docs/source/cpp/compute.rst

edponce · 2021-09-14T17:45:39Z

Based on the semantics of the scalar binary kernels, I am adding a kernel exec generator for binary string transforms. This includes an output adapter and array iterator.

pitrou · 2021-09-16T14:56:04Z

Feel free to undraft when this is ready @edponce .

edponce · 2021-10-19T20:26:58Z

I have the following questions which I am not sure how to resolve:

I tried allowing integers, floating point, and boolean to the num_repeats argument. These are casted to Int64Type via DispatchBest but the floating point casting to integers triggers truncation error. How can this be achieved?
Should an error be return if num_repeats argument is non-negative? Currently, a negative value is treated as a zero-value to match Python behavior, but base R strrep triggers error.
Added R binding named as strrep but the base R version is used instead cc @jonkeane

Warning: Expression strrep(x, 3) not supported in Arrow; pulling data into R

cc @lidavidm @bkietz

lidavidm · 2021-10-19T20:56:02Z

For 1: why support floating point or boolean arguments in the first place? It seems quite odd. I think those should be explicit casts if the user wants them, and they can choose safe/unsafe cast as appropriate.

For 2: I don't think there's a clear argument either way. If the difference in behavior is critical in a particular application, the data could always be checked/massaged either way beforehand. Otherwise I would lean towards explicitly erroring. (You could argue that in Python, you're explicitly calling into Arrow and hence it's clear there may be a difference, while in R the user is likely using dplyr and so a difference may not be top-of-mind.)

jonkeane · 2021-10-19T21:12:28Z

I agree with David about erroring: it seems a bit odd to me that negative values are silently assumed to be 0.

In the first message, you have the following error:

Warning: Expression strrep(x, 3) not supported in Arrow; pulling data into R

That seems odd, I would expect 3 to be fine, is there a missing - somewhere?

…anges

edponce · 2021-11-03T21:03:42Z

Renamed R internal function str_dup to duplicate_string because it was shadowing stringr's str_dup and kernel binding for binary_repeat.
Thanks to @thisisnic for identifying this subtle issue!

ursabot · 2021-11-04T14:44:08Z

Benchmark runs are scheduled for baseline = 5897217 and contender = 0ead7c9. 0ead7c9 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️1.54% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️1.25% ⬆️0.89%] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

github-actions bot added the Component: C++ label Aug 27, 2021

github-actions bot added Component: Python Component: R labels Aug 30, 2021

edponce force-pushed the ARROW-12712-String-repeat-kernel branch 2 times, most recently from 6942fa8 to 23d7516 Compare September 4, 2021 04:21

edponce force-pushed the ARROW-12712-String-repeat-kernel branch 3 times, most recently from 7e0babd to f487172 Compare September 9, 2021 04:48

pitrou reviewed Sep 9, 2021

View reviewed changes

edponce marked this pull request as draft September 13, 2021 01:43

edponce force-pushed the ARROW-12712-String-repeat-kernel branch from d4ed4f8 to 60c73c7 Compare September 20, 2021 09:24

edponce force-pushed the ARROW-12712-String-repeat-kernel branch from 60c73c7 to b1d4cb1 Compare October 19, 2021 01:18

edponce marked this pull request as ready for review October 19, 2021 09:05

edponce requested a review from pitrou October 19, 2021 09:12

edponce added 18 commits November 3, 2021 11:52

improve comments, add Status as a parameter, and minor consistency ch…

156ae79

…anges

wrap long lines and rename tests

454bd49

add Result return types and minor changes

72ec626

add StringRepeat benchmark

df17787

remove std::function indirection

eade790

add static_cast to Transform return value

8de20dc

update R test

9e28894

fix lint error

7c0673d

revert include statement

f441b15

change xor to subtraction and rename var in doubling approach

225cc68

rename function to binary_repeat

62969b3

update function name in benchmark

6ec68ae

fix R func name error

9dbc421

R changes: input -> .input, transmute -> mutate

73a6cfc

update function name in R expressions

a28f25d

add R str_dup binding and test

9ea2d4d

use different vars in R tests

eb72d63

add test with num_repeat=null

6aabb4f

edponce force-pushed the ARROW-12712-String-repeat-kernel branch from d5d98f9 to 6aabb4f Compare November 3, 2021 15:55

edponce added 6 commits November 3, 2021 13:00

use transmute instead of mutate?

cc47c65

add str_dup to test title and revert to mutate()

6bce7c2

add L to number range

c555bd8

split tests

29df265

use different var names

c5e83a9

rename R func str_dup

97a15ad

lidavidm approved these changes Nov 4, 2021

View reviewed changes

lidavidm closed this in 0ead7c9 Nov 4, 2021

asfimport mentioned this pull request Dec 23, 2021

[C++] String repeat kernel #18661

Closed

Conversation

edponce commented Aug 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 27, 2021

Uh oh!

edponce commented Aug 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lidavidm commented Aug 28, 2021

Uh oh!

edponce commented Aug 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lidavidm commented Aug 28, 2021

Uh oh!

edponce commented Aug 30, 2021

Uh oh!

pitrou commented Aug 30, 2021

Uh oh!

edponce commented Aug 30, 2021

Uh oh!

edponce commented Sep 4, 2021

Uh oh!

pitrou commented Sep 9, 2021

Uh oh!

edponce commented Sep 9, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

edponce commented Sep 14, 2021

Uh oh!

pitrou commented Sep 16, 2021

Uh oh!

edponce commented Oct 19, 2021

Uh oh!

lidavidm commented Oct 19, 2021

Uh oh!

jonkeane commented Oct 19, 2021

Uh oh!

edponce commented Nov 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ursabot commented Nov 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

edponce commented Aug 27, 2021 •

edited

Loading

edponce commented Aug 28, 2021 •

edited

Loading

edponce commented Aug 28, 2021 •

edited

Loading

edponce commented Nov 3, 2021 •

edited

Loading

ursabot commented Nov 4, 2021 •

edited

Loading