Add utf-8 validation checking for substring#1577
Conversation
update doc add a test for invalid array type Signed-off-by: remzi <[email protected]>
clean up Signed-off-by: remzi <[email protected]>
Signed-off-by: remzi <[email protected]>
Signed-off-by: remzi <[email protected]>
Codecov Report
@@ Coverage Diff @@
## master #1577 +/- ##
==========================================
+ Coverage 82.93% 82.95% +0.01%
==========================================
Files 193 193
Lines 55350 55373 +23
==========================================
+ Hits 45907 45934 +27
+ Misses 9443 9439 -4
Continue to review full report at Codecov.
|
tustvold
left a comment
There was a problem hiding this comment.
Looking good, left some comments
| Box::new(|old_start: OffsetSize, end: OffsetSize| (end + start).max(old_start)) | ||
| // check the `idx` is a valid char boundary or not | ||
| // If yes, return `idx`, else return error | ||
| let check_char_boundary = |idx: OffsetSize| { |
There was a problem hiding this comment.
I wonder if we could instead interpret data as &str with std::str::from_utf8_unchecked, which should be valid if this is a valid string array, and then use the standard library is_char_boundary instead of implementing our own
| // If yes, return `idx`, else return error | ||
| let check_char_boundary = |idx: OffsetSize| { | ||
| let idx_usize = idx.to_usize().unwrap(); | ||
| if idx_usize == data.len() || data[idx_usize] >> 6 != 0b10_u8 { |
There was a problem hiding this comment.
| if idx_usize == data.len() || data[idx_usize] >> 6 != 0b10_u8 { | |
| // A valid code-point iff it does not start with 0b10xxxxxx | |
| // Bit-magic taken from `std::str::is_char_boundary` | |
| if idx_usize == data.len() || (data[idx_usize] as i8) < -0x40 { |
There is a bit-magic trick to save an operator, although it is possible LLVM is smart enough to notice this. I also think it would be better to just use the standard library version directly if possible.
| #[test] | ||
| fn check_invalid_array_type() { | ||
| let array = Int32Array::from(vec![Some(1), Some(2), Some(3)]); | ||
| assert_eq!(substring(&array, 0, None).is_err(), true); |
There was a problem hiding this comment.
I think it would make these tests easier to read, and also potentially less fragile if they checked the error message, e.g. something like
let err = substring(&array, 0, None).unwrap_err().to_string();
assert!(err.contains("foo"), "{}", err);
| Box::new(|start: OffsetSize, end: OffsetSize| (*length).min(end - start)) | ||
| } else { | ||
| Box::new(|start: OffsetSize, end: OffsetSize| end - start) | ||
| // function for calculating the start index of each string |
There was a problem hiding this comment.
| // function for calculating the start index of each string | |
| // Function for calculating the start index of each string | |
| // | |
| // Takes the start and end offsets of the original string, and returns the byte offset | |
| // to start copying data from the source values array | |
| // | |
| // Returns an error if this offset is not at a valid UTF-8 char boundary |
Or something to that effect, it wasn't clear to me whether this was the offset in the new array or the old
| Ordering::Greater => Box::new(|old_start, end| { | ||
| check_char_boundary((old_start + start).min(end)) | ||
| }), | ||
| Ordering::Equal => Box::new(|old_start, _| Ok(old_start)), |
| let new_length = cal_new_length(new_start, pair[1]); | ||
| len_so_far += new_length; | ||
| new_starts_ends.push((new_start, new_start + new_length)); | ||
| offsets.windows(2).try_for_each(|pair| -> Result<()> { |
There was a problem hiding this comment.
I'm curious if the dyn dispatch functions perform better than letting LLVM perform loop unswitching, I would have thought eliminating the dyn-dispatch would be both clearer and faster
There was a problem hiding this comment.
The reasons that I use dyn Fn are:
- We should do the comparison for
startand the destruction forlengthoutside the for loop. And using closure is a convenient way, otherwise there would be lots of duplicate code. - Closures have different sizes because they capture different environment.
- The dynamic dispatching doesn't impact the performance in this function because they are not hotspots.
There was a problem hiding this comment.
My point was that modern optimising compilers will automatically lift conditionals out of loops, in a process called loop unswitching, and that letting it do this can be faster not just from eliminating dynamic dispatch, but also letting it further optimise the code.
I don't feel strongly about it, but thought I'd mention that you might be able to make the code clearer and faster by letting the compiler handle this
There was a problem hiding this comment.
Thank you for your explanation. Let me have a try!
| }), | ||
| }; | ||
|
|
||
| // function for calculating the end index of each string |
There was a problem hiding this comment.
Some documentation of this functions inputs and output would be helpful I think
arrow/benches/string_kernels.rs
Outdated
|
|
||
| let arr_string = create_string_array_with_len::<i32>(size, 0.0, str_len); | ||
| let start = 0; | ||
| let start = 1; |
There was a problem hiding this comment.
Perhaps we could benchmark the various permutations of no start offset and no length?
There was a problem hiding this comment.
Done! And I have updated the benchmark result.
There was a problem hiding this comment.
The validation checking is only enabled when start !=0 or length != None. So there is no performance cost for the first micro-bench
| /// # Error | ||
| /// this function errors when the passed array is not a \[Large\]String array. | ||
| /// - The function errors when the passed array is not a \[Large\]String array. | ||
| /// - The function errors when you try to create a substring in the middle of a multibyte character. |
There was a problem hiding this comment.
Hm, multibyte character sounds a bit vague, maybe just say invalid utf-8 format?
| if idx_usize == data.len() || data[idx_usize] >> 6 != 0b10_u8 { | ||
| Ok(idx) | ||
| } else { | ||
| Err(ArrowError::ComputeError(format!("The index {} is invalid because {:#010b} is not the head byte of a char.", idx_usize, data[idx_usize]))) |
There was a problem hiding this comment.
The error message may not be understood by end-users. Maybe just say "The substring begins at index {} is an invalid utf8 char because it is not the first byte in a utf8 code point sequence"?
Signed-off-by: remzi <[email protected]>
Signed-off-by: remzi <[email protected]>
update doc Signed-off-by: remzi <[email protected]>
Signed-off-by: remzi <[email protected]>
Signed-off-by: remzi <[email protected]>
Signed-off-by: remzi <[email protected]>
tustvold
left a comment
There was a problem hiding this comment.
This looks good to me - thank you 👍
I'll leave this open for a bit longer to give others the chance to review, otherwise I'll merge this in this afternoon (GMT)
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Which issue does this PR close?
Closes #1575
What changes are included in this PR?
Are there any user-facing changes?
No.
Benchmark
We only check the char boundary validation when
start != 0orlength != None.