-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Describe the bug
The substring kernel can only work on chars that are encoded as 1 byte in utf-8 standard. If the string contains a char that requires more than 1 byte, the function will panic.
To Reproduce
Steps to reproduce the behavior:
Give a string "E=mc²", start index = -1, length = None.
the expected result is "²".
However, I got:
thread 'compute::kernels::substring::tests::without_nulls_string' panicked at 'byte index 2 is out of bounds of `�`', library/core/src/fmt/mod.rs:2160:30
The reason is that the char ² is encoded as 0xC2 0xB2 in utf8 standard. When we tried to get the last char in string, what we really get is a byte sequence [0xB2] which is invalid in utf-8 standard.
Expected behavior
I think there are three ways to fix the bug:
1.(easy) Update the doc of the substring function to explain we only support 1-byte utf-8 chars. Also explain that start and length are counted in bytes.
2.(a little difficult) check the string array only contains 1-byte utf-8 chars (the highest-order bit is 0) in the substring function.
3.(difficult, and the API will be changed) Intercept based on characters, not bytes.
Additional context
Add any other context about the problem here.