ARROW-10959: [C++] Add scalar string join kernel#8990
ARROW-10959: [C++] Add scalar string join kernel#8990maartenbreddels wants to merge 5 commits intoapache:masterfrom
Conversation
8f6e26e to
609a388
Compare
d4608a9 to
356c300
Compare
609a388 to
d42058a
Compare
d42058a to
9c0836d
Compare
|
Failures seem unrelated. |
pitrou
left a comment
There was a problem hiding this comment.
Thanks @maartenbreddels . Here are some comments.
|
Regarding "join" in the name: Regarding "binary" in the name: It would also be good for the name to reflect that this operates on a list array. How about renaming it |
|
I'll just point out that "join" is not a python-ism. There is a string join in Java, Rust, C#, JavaScript, etc. and it is consistently called join. I think R is the only language I can find that doesn't call it "join". |
SQL—which probably has more users than all those languages combined 😁 |
Snark aside, that's a good point, and since users who are thinking in the SQL way will always be at least one abstraction layer removed from the C++ library, that makes me think "join" is fine 👍 |
|
In SQL it's called |
In SQL, |
|
@maartenbreddels Do you want to update this PR? |
|
I'm going to take this up. |
f06549e to
9353e4e
Compare
|
I adressed review comments, added a benchmark, and made |
|
Still a TODO to address (also support large_list inputs). Edit: done |
There was a problem hiding this comment.
To keep the public overloads to a minimum, these should probably just take Datums. Added https://issues.apache.org/jira/browse/ARROW-12953
There was a problem hiding this comment.
We don't lose anything for string kernels which always do their own allocation, but it's worth noting that *out is always a correctly shaped ArrayData. We could also safely write
| std::shared_ptr<Array> string_array; | |
| RETURN_NOT_OK(builder->Finish(&string_array)); | |
| *out = *string_array->data(); | |
| RETURN_NOT_OK(builder->FinishInternal(const_cast<std::shared_ptr<ArrayData>*>(&out->array()))); |
@jorisvandenbossche I've implemented this kernel as a binary (arity) kernel, so the input list array and the separator input string array can both be an array (see python test).
I did not implement the case where the input list is a scalar, and the separator an array, since I don't think that's very common.
And note that the kernel is named
binary_joinbecause it takes string-like and binary-like inputs.