speed up `substring_by_char` by about 2.5x by HaoYang670 · Pull Request #1832 · apache/arrow-rs

HaoYang670 · 2022-06-10T06:24:10Z

Signed-off-by: remzi [email protected]

Which issue does this PR close?

Closes #1800.

What changes are included in this PR?

Directly copy the string slice to BufferBuilder.

Are there any user-facing changes?

No.

Benchmark

substring utf8 by char  time:   [48.896 ms 48.942 ms 48.990 ms]                                   
                        change: [-58.192% -58.123% -58.060%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)

about 2.5x

Signed-off-by: remzi <[email protected]>

codecov-commenter · 2022-06-10T06:46:40Z

Codecov Report

Merging #1832 (1966e0e) into master (db41b33) will decrease coverage by 0.04%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1832      +/-   ##
==========================================
- Coverage   83.53%   83.49%   -0.05%     
==========================================
  Files         200      201       +1     
  Lines       56798    56903     +105     
==========================================
+ Hits        47449    47510      +61     
- Misses       9349     9393      +44

Impacted Files	Coverage Δ
arrow/src/compute/kernels/substring.rs	`99.53% <100.00%> (+0.01%)`	⬆️
arrow/src/compute/kernels/temporal.rs	`95.77% <0.00%> (-1.36%)`	⬇️
parquet/src/encodings/encoding.rs	`93.46% <0.00%> (-0.20%)`	⬇️
arrow/src/array/mod.rs	`100.00% <0.00%> (ø)`
parquet/src/bin/parquet-fromcsv.rs	`0.00% <0.00%> (ø)`
parquet_derive/src/parquet_field.rs	`65.98% <0.00%> (+0.45%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update db41b33...1966e0e. Read the comment docs.

HaoYang670 · 2022-06-10T07:56:55Z

arrow/src/compute/kernels/substring.rs

-                let length = length.map_or(char_count - start, |length| {
-                    length.to_usize().unwrap().min(char_count - start)
-                });
+    let mut vals = BufferBuilder::<u8>::new(array.value_data().len());


This will always allocate more memory than we need, but it's no big deal.

If you wanted to improve this estimate you could compute the difference between the first and last offset. It would still be an over-estimate, but would over-estimate sliced arrays less

Did you want to do this, or should I just get this in?

I think this is good enough. It has O(n) space complexity (where n is the byte length of the value buffer of the input array) and at most allocate n extra memory than we need (when length = 0).

Sorry I wasn't clear, the problem here is that the values buffer may be substantially larger than necessary as a result of slicing.

Say I have an array of 1000 strings, and I take a slice of length two, the values array is completely unchanged, but the offsets array is now length of 3, instead of 1001.

By taking the delta between the first and last offset in the source array, you ensure you are not allocating capacity that you definitely don't need, as technically that capacity isn't even needed by the source array

Thank you for your explanation, which makes perfect sense.
Forgot offset and slicing. 😅 I will fix this.

Updated! No impact on the benchmark result

HaoYang670 · 2022-06-10T08:03:21Z

arrow/src/compute/kernels/substring.rs

+            vec![],
+        )
+    };
+    Ok(GenericStringArray::<OffsetSize>::from(data))


The Ok() here is just to make this function to be consistent with substring. It is meaningless because this function never return an error (Actually, I don't like unwrap or panic in a function returning Result. But the code would be much unclear is I remove unwrap (lots of ok_or(...))).

Panicking if an invariant is violated I think makes perfect sense, errors only really make sense imo if you expect some higher level system to catch and handle that error.

HaoYang670 · 2022-06-10T08:25:28Z

arrow/src/compute/kernels/substring.rs

+        if let Some(val) = val {
+            let char_count = val.chars().count();
+            let start = if start >= 0 {
+                start.to_usize().unwrap()


I am curious about why to_usize() returns Option<usize> but not Result<usize>.
Unfortunately, the rust std library also chooses Option:

/// Converts the value of `self` to a `usize`. If the value cannot be /// represented by a `usize`, then `None` is returned. #[inline] fn to_usize(&self) -> Option<usize> { self.to_u64().as_ref().and_then(ToPrimitive::to_usize) }

But Result is more reasonable for me. Because we could tell users the reason of casting failure :

#[inline] fn to_usize(&self) -> Result<usize> { match num::ToPrimitive::to_usize(self) { Some(val) => Ok(val), None => Err(ArrowError::CastError(format!("Cannot cast {} to usize")) } }

And for unsupported types, we can have a default implementation:

/// Convert native type to usize. #[inline] fn to_usize(&self) -> Result<usize> { Err(ArrowError::CastError(format!("Casting {} to usize is not supported.", type_name::<Self>()))) }

The rationale is likely that there is only one possible error variant, and so it is up to the caller to add this context in the manner appropriate to how they have chosen to handle errors - be it panic, anyhow, custom error enumerations, etc...

As for the semantics on ArrowNativeType, I happen to not be a fan of returning None for "unsupported" floating point types - my 2 cents is we should just use num-traits and not roll our own 😅

tustvold · 2022-06-10T08:17:01Z

arrow/src/compute/kernels/substring.rs

+            array
+                .data_ref()
+                .null_buffer()
+                .map(|b| b.bit_slice(array.offset(), array.len())),


tustvold · 2022-06-10T08:18:37Z

arrow/src/compute/kernels/substring.rs

-                let length = length.map_or(char_count - start, |length| {
-                    length.to_usize().unwrap().min(char_count - start)
-                });
+    let mut vals = BufferBuilder::<u8>::new(array.value_data().len());


If you wanted to improve this estimate you could compute the difference between the first and last offset. It would still be an over-estimate, but would over-estimate sliced arrays less

tustvold · 2022-06-10T08:20:59Z

arrow/src/compute/kernels/substring.rs

+    offsets.append(OffsetSize::zero());
+    let length = length.map(|len| len.to_usize().unwrap());
+
+    array.iter().for_each(|val| {


A further improvement might be to iterate the offsets directly, as we don't actually care about skipping over nulls

I am not 100% sure whether we could get further speed-up from this. Because for each value slot, we have to transmute it to string to scan it over (using char_indices) to get the start and end offset.

https://doc.rust-lang.org/std/str/fn.from_utf8_unchecked.html is free and safe in this context FWIW

https://doc.rust-lang.org/std/str/fn.from_utf8_unchecked.html is free and safe in this context FWIW

Our GenericStringIter is implemented based on this. That's why I guess array.iter() is optimal enough.
BTW, skipping nulls could not be the hotspot in this function, because it is just a bitwise operation.

because it is just a bitwise operation

I would not underestimate how remarkably slow bitwise operations can be, especially when used to predicate branches. But yes, likely for this kernel the bottleneck lies elsewhere

tustvold · 2022-06-10T08:23:36Z

arrow/src/compute/kernels/substring.rs

+            vec![],
+        )
+    };
+    Ok(GenericStringArray::<OffsetSize>::from(data))


Panicking if an invariant is violated I think makes perfect sense, errors only really make sense imo if you expect some higher level system to catch and handle that error.

Signed-off-by: remzi <[email protected]>

speed up substring_by_char

c069e5b

Signed-off-by: remzi <[email protected]>

github-actions bot added the arrow Changes to the arrow crate label Jun 10, 2022

HaoYang670 commented Jun 10, 2022

View reviewed changes

tustvold approved these changes Jun 10, 2022

View reviewed changes

better estimate the length of value buffer

1966e0e

Signed-off-by: remzi <[email protected]>

tustvold approved these changes Jun 11, 2022

View reviewed changes

tustvold merged commit c08e532 into apache:master Jun 11, 2022

HaoYang670 deleted the speed_up_substring_by_char branch June 11, 2022 09:36

Comments

Conversation

HaoYang670 commented Jun 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

What changes are included in this PR?

Are there any user-facing changes?

Benchmark

Uh oh!

codecov-commenter commented Jun 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tustvold Jun 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tustvold Jun 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HaoYang670 commented Jun 10, 2022 •

edited

Loading

codecov-commenter commented Jun 10, 2022 •

edited

Loading

tustvold Jun 11, 2022 •

edited

Loading

tustvold Jun 10, 2022 •

edited

Loading