ARROW-11693: [C++] Add string length kernel#9786
ARROW-11693: [C++] Add string length kernel#9786edponce wants to merge 8 commits intoapache:masterfrom
Conversation
|
Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? Then could you also rename pull request title in the following format? See also: |
| // Single UTF8 character straddles two entries | ||
| auto st3 = ValidateFull(2, {0, 1, 2}, "\xc3\xa9"); | ||
| // Null characters in the string | ||
| auto st4 = ValidateFull(1, {0, 4}, "\0\0\0\0"); |
There was a problem hiding this comment.
Can you explain why this is invalid? Unicode character 0 is a valid unicode character.
There was a problem hiding this comment.
Also, it works using PyArrow:
>>> arr = pa.array(["\0"])
>>> arr
<pyarrow.lib.StringArray object at 0x7f6befb354b0>
[
""
]
>>> arr.to_pylist()
['\x00']
>>> arr.validate(full=True)
>>> There was a problem hiding this comment.
You are right, this case is valid. The UTF-8 standard (https://tools.ietf.org/html/rfc3629) states that characters consisting of 1-byte range from 0-7F (Section 3). But a multibyte UTF-8 character will not contain NUL as a continuation byte, the encoding does not allows it.
Valid: '\x00'
Invalid: '\xe10080'
There was a problem hiding this comment.
Also, note that UTF8 validation is tested more thoroughly in https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/utf8_util_test.cc
pitrou
left a comment
There was a problem hiding this comment.
Thanks for the updates! Here are two more comments.
This PR adds the utf8_length compute kernel to the string scalar functions to support calculating the string length (as number of characters) for UTF-8 encoded STRINGs and LARGE STRINGs. The implementation makes use of utf8proc (utf8proc_iterate) to perform the calculation.