Improve doc string for substring kernel#1529
Conversation
Signed-off-by: remzi <[email protected]>
Signed-off-by: remzi <[email protected]>
Codecov Report
@@ Coverage Diff @@
## master #1529 +/- ##
==========================================
- Coverage 82.78% 82.77% -0.01%
==========================================
Files 190 190
Lines 54827 54827
==========================================
- Hits 45386 45384 -2
- Misses 9441 9443 +2
Continue to review full report at Codecov.
|
alamb
left a comment
There was a problem hiding this comment.
❤️ thank you so much @HaoYang670
Looks very nice
| /// let array = StringArray::from(vec![Some("E=mc²")]); | ||
| /// let result = substring(&array, -1, &None).unwrap(); | ||
| /// let result = result.as_any().downcast_ref::<StringArray>().unwrap(); | ||
| /// assert_eq!(result.value(0).as_bytes(), &[0x00B2]); // invalid utf-8 format |
There was a problem hiding this comment.
🤔 Is it worth a ticket to add utf8 validation to this kernel? It should be straightforward, but will likely slow it down.
There was a problem hiding this comment.
I am afraid the time complexity will be changed. Currently, the time complexity is O(length of offset buffer). Adding validation might increase the time complexity to O(length of offset buffer + length of value buffer) because we have to read every byte in the value buffer in the worst case:
https://github.com/apache/arrow-rs/blob/master/arrow/src/array/data.rs#L1090-L1112
There was a problem hiding this comment.
By the way, is it necessary to mark substring unsafe?
Besides, we could implement a safe version of substring and also substring_by_char:
| name | Safety | Performance |
|---|---|---|
| unsafe substring_unckecked | user should make sure valid utf8 substrings can be gotten with start and length |
fast |
| substring_checked | None | slow |
| substring_by_char | None | slow |
Users should always make sure all values of the input array are in valid utf-8 format.
This will improve the safety of the substring kernel, but break the public API.
There was a problem hiding this comment.
Also, Arrow2 has implemented substring for binary array. I am not sure whether we need to do it.
There was a problem hiding this comment.
We could also add substring support for ListArray, FixedSizeListArray and FixedSizeBinaryArray
There was a problem hiding this comment.
@HaoYang670 I think these are all good ideas (making safe / unsafe versions, and adding substring for binary etc). I would be happy to create tickets for them if that would help
If we are going to change the substring kernel, I suggest we do the following to follow Rust's safe-by-default mantra:
| name | Safety | Performance | Notes |
|---|---|---|---|
| unsafe substring_bytes_unchecked | unsafe | fast | what is currently called substring - user should make sure valid utf8 substrings can be gotten with start and length |
| substring_bytes | safe | slow | Implement substring by bytes - throws error of invalid utf-8 sequence is created |
| substring | safe | slow | Implement substring by char, ensuring all substrings are utf-8 |
Co-authored-by: Andrew Lamb <[email protected]>
substringsubstring kernel
Signed-off-by: remzi [email protected]
Which issue does this PR close?
Closes #1478.
Rationale for this change
Tell users the
pitfallofsubstringfunction.Are there any user-facing changes?
Doc is updated.
Follow on work:
#1531