The `substring` kernel panics when chars > U+0x007F

**Describe the bug**
The `substring` kernel can only work on chars that are encoded as 1 byte in utf-8 standard. If the string contains a char that requires more than 1 byte, the function will panic.

**To Reproduce**
Steps to reproduce the behavior:
Give a string `"E=mc²"`, start index = `-1`, length = `None`. 
the expected result is "²".
However, I got:
```
thread 'compute::kernels::substring::tests::without_nulls_string' panicked at 'byte index 2 is out of bounds of `�`', library/core/src/fmt/mod.rs:2160:30
```

The reason is that the char `²` is encoded as `0xC2 0xB2` in utf8 standard. When we tried to get the last char in string, what we really get is a byte sequence `[0xB2]` which is invalid in utf-8 standard.

**Expected behavior**
I think there are three ways to fix the bug:
1.(easy) Update the doc of the  `substring` function to explain we only support 1-byte utf-8 chars. Also explain that `start` and `length` are counted in bytes. 
2.(a little difficult) check the string array only contains 1-byte utf-8 chars (the highest-order bit is 0) in the `substring` function.
3.(difficult, and the API will be changed) Intercept based on characters, not bytes.

**Additional context**
Add any other context about the problem here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The `substring` kernel panics when chars > U+0x007F #1478

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The substring kernel panics when chars > U+0x007F #1478

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

The `substring` kernel panics when chars > U+0x007F #1478