speed up substring_by_char by about 2.5x#1832
Conversation
Signed-off-by: remzi <[email protected]>
Codecov Report
@@ Coverage Diff @@
## master #1832 +/- ##
==========================================
- Coverage 83.53% 83.49% -0.05%
==========================================
Files 200 201 +1
Lines 56798 56903 +105
==========================================
+ Hits 47449 47510 +61
- Misses 9349 9393 +44
Continue to review full report at Codecov.
|
| let length = length.map_or(char_count - start, |length| { | ||
| length.to_usize().unwrap().min(char_count - start) | ||
| }); | ||
| let mut vals = BufferBuilder::<u8>::new(array.value_data().len()); |
There was a problem hiding this comment.
This will always allocate more memory than we need, but it's no big deal.
There was a problem hiding this comment.
If you wanted to improve this estimate you could compute the difference between the first and last offset. It would still be an over-estimate, but would over-estimate sliced arrays less
There was a problem hiding this comment.
Did you want to do this, or should I just get this in?
There was a problem hiding this comment.
I think this is good enough. It has O(n) space complexity (where n is the byte length of the value buffer of the input array) and at most allocate n extra memory than we need (when length = 0).
There was a problem hiding this comment.
Sorry I wasn't clear, the problem here is that the values buffer may be substantially larger than necessary as a result of slicing.
Say I have an array of 1000 strings, and I take a slice of length two, the values array is completely unchanged, but the offsets array is now length of 3, instead of 1001.
By taking the delta between the first and last offset in the source array, you ensure you are not allocating capacity that you definitely don't need, as technically that capacity isn't even needed by the source array
There was a problem hiding this comment.
Thank you for your explanation, which makes perfect sense.
Forgot offset and slicing. 😅 I will fix this.
There was a problem hiding this comment.
Updated! No impact on the benchmark result
| vec![], | ||
| ) | ||
| }; | ||
| Ok(GenericStringArray::<OffsetSize>::from(data)) |
There was a problem hiding this comment.
The Ok() here is just to make this function to be consistent with substring. It is meaningless because this function never return an error (Actually, I don't like unwrap or panic in a function returning Result. But the code would be much unclear is I remove unwrap (lots of ok_or(...))).
There was a problem hiding this comment.
Panicking if an invariant is violated I think makes perfect sense, errors only really make sense imo if you expect some higher level system to catch and handle that error.
| if let Some(val) = val { | ||
| let char_count = val.chars().count(); | ||
| let start = if start >= 0 { | ||
| start.to_usize().unwrap() |
There was a problem hiding this comment.
I am curious about why to_usize() returns Option<usize> but not Result<usize>.
Unfortunately, the rust std library also chooses Option:
/// Converts the value of `self` to a `usize`. If the value cannot be
/// represented by a `usize`, then `None` is returned.
#[inline]
fn to_usize(&self) -> Option<usize> {
self.to_u64().as_ref().and_then(ToPrimitive::to_usize)
}But Result is more reasonable for me. Because we could tell users the reason of casting failure :
#[inline]
fn to_usize(&self) -> Result<usize> {
match num::ToPrimitive::to_usize(self) {
Some(val) => Ok(val),
None => Err(ArrowError::CastError(format!("Cannot cast {} to usize"))
}
}And for unsupported types, we can have a default implementation:
/// Convert native type to usize.
#[inline]
fn to_usize(&self) -> Result<usize> {
Err(ArrowError::CastError(format!("Casting {} to usize is not supported.", type_name::<Self>())))
}There was a problem hiding this comment.
The rationale is likely that there is only one possible error variant, and so it is up to the caller to add this context in the manner appropriate to how they have chosen to handle errors - be it panic, anyhow, custom error enumerations, etc...
As for the semantics on ArrowNativeType, I happen to not be a fan of returning None for "unsupported" floating point types - my 2 cents is we should just use num-traits and not roll our own 😅
| array | ||
| .data_ref() | ||
| .null_buffer() | ||
| .map(|b| b.bit_slice(array.offset(), array.len())), |
| let length = length.map_or(char_count - start, |length| { | ||
| length.to_usize().unwrap().min(char_count - start) | ||
| }); | ||
| let mut vals = BufferBuilder::<u8>::new(array.value_data().len()); |
There was a problem hiding this comment.
If you wanted to improve this estimate you could compute the difference between the first and last offset. It would still be an over-estimate, but would over-estimate sliced arrays less
| offsets.append(OffsetSize::zero()); | ||
| let length = length.map(|len| len.to_usize().unwrap()); | ||
|
|
||
| array.iter().for_each(|val| { |
There was a problem hiding this comment.
A further improvement might be to iterate the offsets directly, as we don't actually care about skipping over nulls
There was a problem hiding this comment.
I am not 100% sure whether we could get further speed-up from this. Because for each value slot, we have to transmute it to string to scan it over (using char_indices) to get the start and end offset.
There was a problem hiding this comment.
https://doc.rust-lang.org/std/str/fn.from_utf8_unchecked.html is free and safe in this context FWIW
There was a problem hiding this comment.
https://doc.rust-lang.org/std/str/fn.from_utf8_unchecked.html is free and safe in this context FWIW
Our GenericStringIter is implemented based on this. That's why I guess array.iter() is optimal enough.
BTW, skipping nulls could not be the hotspot in this function, because it is just a bitwise operation.
There was a problem hiding this comment.
because it is just a bitwise operation
I would not underestimate how remarkably slow bitwise operations can be, especially when used to predicate branches. But yes, likely for this kernel the bottleneck lies elsewhere
| vec![], | ||
| ) | ||
| }; | ||
| Ok(GenericStringArray::<OffsetSize>::from(data)) |
There was a problem hiding this comment.
Panicking if an invariant is violated I think makes perfect sense, errors only really make sense imo if you expect some higher level system to catch and handle that error.
Signed-off-by: remzi <[email protected]>
Signed-off-by: remzi [email protected]
Which issue does this PR close?
Closes #1800.
What changes are included in this PR?
Directly copy the string slice to
BufferBuilder.Are there any user-facing changes?
No.
Benchmark
about 2.5x