ARROW-12713 [C++] String reverse kernel#10317
ARROW-12713 [C++] String reverse kernel#10317nirandaperera wants to merge 15 commits intoapache:masterfrom
Conversation
|
Learned from @cyb70289 that new kernels need to extend both C++ and Python documentation:
|
6b4695f to
24a6088
Compare
lidavidm
left a comment
There was a problem hiding this comment.
Thanks. I left some comments about wording - I don't think we need to explain the Unicode details here, just mention that we operate on codepoints/code units, and not grapheme clusters.
| return utf8_char_found == 0; | ||
| } | ||
|
|
||
| static Status InvalidStatus() { return Status::Invalid("Non-ascii sequence in input"); } |
There was a problem hiding this comment.
Nit: can we be consistent with ASCII?
There was a problem hiding this comment.
I do not think you should add InvalidStatus methods with specific messages that do not necessarily generalize for all string errors. Why not directly invoke Status::Invalid("Specific message to specific code block")?
There was a problem hiding this comment.
I am reusing the StringTransform class for both ascii and utf8 kernels. In that, it always throws a Status::Invalid("Invalid UTF8 sequence in input") which is not quite right for an ascii kernel. That's why it was generalized and offloaded to Derived::InvalidStatus() 🙂
There was a problem hiding this comment.
Just to clarify here, the structure of the kernel means the 'implementation' doesn't directly return an error, only true/false. Would you prefer to change it so that it returns a status instead?
|
Remember to update Python documentation as well, https://github.com/apache/arrow/blob/master/docs/source/python/api/compute.rst |
lidavidm
left a comment
There was a problem hiding this comment.
One more nit from me (tests are failing due to a small capitalization issue)
Co-authored-by: David Li <[email protected]>
|
The formatting needs to be fixed here (see the logs for the Dev action), the other CI failures look unrelated. You could try |
|
@lidavidm I think this looks okay now. |
| // would produce arrays with same lengths. Hence checking offset buffer equality | ||
| auto malformed_input = ArrayFromJSON(this->type(), "[\"ɑ\xFFɑa\", \"ɽ\xe1\xbdɽa\"]"); | ||
| const Result<Datum>& res = CallFunction("utf8_reverse", {malformed_input}); | ||
| ASSERT_TRUE(res->array()->buffers[1]->Equals(*malformed_input->data()->buffers[1])); |
There was a problem hiding this comment.
Just a nit, but FTR we probably have a AssertBufferEquals (or perhaps AssertBuffersEqual :-)).
This PR adds ascii and utf8 reverse kernels.