Skip to content

The substring kernel panics when chars > U+0x007F #1478

@HaoYang670

Description

@HaoYang670

Describe the bug
The substring kernel can only work on chars that are encoded as 1 byte in utf-8 standard. If the string contains a char that requires more than 1 byte, the function will panic.

To Reproduce
Steps to reproduce the behavior:
Give a string "E=mc²", start index = -1, length = None.
the expected result is "²".
However, I got:

thread 'compute::kernels::substring::tests::without_nulls_string' panicked at 'byte index 2 is out of bounds of `�`', library/core/src/fmt/mod.rs:2160:30

The reason is that the char ² is encoded as 0xC2 0xB2 in utf8 standard. When we tried to get the last char in string, what we really get is a byte sequence [0xB2] which is invalid in utf-8 standard.

Expected behavior
I think there are three ways to fix the bug:
1.(easy) Update the doc of the substring function to explain we only support 1-byte utf-8 chars. Also explain that start and length are counted in bytes.
2.(a little difficult) check the string array only contains 1-byte utf-8 chars (the highest-order bit is 0) in the substring function.
3.(difficult, and the API will be changed) Intercept based on characters, not bytes.

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    arrowChanges to the arrow cratebug

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions