Optimization: concat function by JasonLi-cn · Pull Request #9732 · apache/datafusion

JasonLi-cn · 2024-03-22T06:55:18Z

Which issue does this PR close?

Rationale for this change

optimize concat and concat_ws function.

Benchmark(only concat)

Gnuplot not found, using plotters backend
concat function/concat(old)/1024
                        time:   [91.562 µs 91.725 µs 91.905 µs]
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe
concat function/concat(new)/1024
                        time:   [17.831 µs 17.877 µs 17.934 µs]
Found 9 outliers among 100 measurements (9.00%)
  7 (7.00%) high mild
  2 (2.00%) high severe

concat function/concat(old)/4096
                        time:   [357.95 µs 358.98 µs 360.02 µs]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
concat function/concat(new)/4096
                        time:   [71.003 µs 71.168 µs 71.326 µs]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

concat function/concat(old)/8192
                        time:   [758.82 µs 761.43 µs 764.82 µs]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
concat function/concat(new)/8192
                        time:   [141.11 µs 141.44 µs 141.79 µs]

Attention

For the purpose of benchmarking, I haven't officially replaced the concat and concat_ws function yet. If the community finds this PR meaningful, I will proceed with the replacement.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb

Looks very nice to me @JasonLi-cn -- thank you

datafusion/physical-expr/benches/concat.rs

alamb · 2024-03-22T18:03:16Z

datafusion/physical-expr/src/string_expressions.rs

+    }
+}
+
+struct StringArrayBuilder {


I think some comments that explained how this was different than https://docs.rs/arrow/latest/arrow/array/type.StringBuilder.html would help. Maybe simply a note that it didn't check UTF8 again?

I wonder if we could get the same effect by adding an unsafe function to StringBuilder, like

/// Adds bytes to the in progress string, without checking for valid utf8 /// /// Safety: requires that bytes are valid utf8, otherwise an invalid StringArray will result unsafe fn append_unchecked(&mut self, bytes: &[u8])

And then using StringBuilder here

The Write trait impl of StringBuilder can meet my current needs, but it is not very convenient to use, so I've defined a StringArrayBuilder. I agree with your suggestion to add an unsafe function to StringBuilder.

https://github.com/apache/arrow-rs/blob/9f36c883459405ecd9a5f4fdfa9a3317ab52302c/arrow-array/src/builder/generic_bytes_builder.rs#L231-L257

let mut builder = GenericStringBuilder::<i32>::new(); // Write data write!(builder, "foo").unwrap(); write!(builder, "bar").unwrap(); // Finish value builder.append_value("baz"); // Write second value write!(builder, "v2").unwrap(); builder.append_value(""); let array = builder.finish(); assert_eq!(array.value(0), "foobarbaz"); assert_eq!(array.value(1), "v2");

Another issue is that we don't need NullBufferBuilder here, but GenericByteBuilder defaults to using NullBufferBuilder, which I believe introduces unnecessary overhead.

alamb · 2024-03-22T18:06:44Z

datafusion/physical-expr/src/string_expressions.rs

+    fn write<const CHECK_VALID: bool>(&mut self, column: &ColumnarValueRef, i: usize) {
+        match column {
+            ColumnarValueRef::Scalar(s) => {
+                self.value_buffer.extend_from_slice(s);


Is the primary speed savings gained from not checking UTF8 validity (and just copying byte slices)?

My initial intention for optimizing this function was to avoid creating a new String and then using push_str each time in concat function, like:

let mut owned_string: String = "".to_owned(); for arg in args { match arg { ColumnarValue::Scalar(ScalarValue::Utf8(maybe_value)) => { if let Some(value) = maybe_value { owned_string.push_str(value); } } ColumnarValue::Array(v) => { if v.is_valid(index) { let v = as_string_array(v).unwrap(); owned_string.push_str(v.value(index)); } } _ => unreachable!(), } } Some(owned_string)

Additionally, by precalculating the expected length of the result, I avoided the need to reallocate memory.

I used the extend_from_slice function because I referred to the append_slice function of BufferBuilder.

#[inline] pub fn append_slice(&mut self, slice: &[T]) { self.buffer.extend_from_slice(slice); self.len += slice.len(); }

datafusion/physical-expr/src/string_expressions.rs

alamb · 2024-03-22T18:16:09Z

I also filed #9742 to track this improvement

alamb · 2024-03-24T09:50:32Z

Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look

fix: concat_ws chore: add license header add arrow feature update concat

alamb

THank you @JasonLi-cn -- I think this looks good to me

I had some suggestions to improve the comments but I think we could do that as a follow on PR if you prefer

alamb · 2024-04-03T20:50:19Z

datafusion/physical-expr/src/string_expressions.rs

+        let mut offsets_buffer = MutableBuffer::with_capacity(
+            (item_capacity + 1) * std::mem::size_of::<i32>(),
+        );
+        unsafe { offsets_buffer.push_unchecked(0_i32) };


Is it really necessary to avoid the bounds check for a single offset?

Since it is safe here and there is a theoretical performance improvement, I have opted to use push_unchecked in this case.

alamb · 2024-04-03T20:57:42Z

datafusion/physical-expr/src/string_expressions.rs

+        unsafe { self.offsets_buffer.push_unchecked(next_offset) };
+    }
+
+    fn finish(self, null_buffer: Option<NullBuffer>) -> StringArray {


I was trying to think if there was some way to create an invalid result with this API and I think the answer is No (even if append_offset wasn't called ever the offsets would still be valid.

datafusion/physical-expr/src/string_expressions.rs

alamb · 2024-04-04T14:25:34Z

Thanks again @JasonLi-cn 🙏

github-actions bot added the physical-expr Changes to the physical-expr crates label Mar 22, 2024

alamb mentioned this pull request Mar 22, 2024

DataFusion weekly project plan (Andrew Lamb) - March 18, 2024 #9675

Closed

7 tasks

alamb reviewed Mar 22, 2024

View reviewed changes

alamb marked this pull request as draft March 24, 2024 09:50

JasonLi-cn added 3 commits March 30, 2024 21:39

optimization: concat function

80ca94e

fix: concat_ws chore: add license header add arrow feature update concat

change Cargo.toml

625ee69

pass cargo clippy

9d985c1

JasonLi-cn force-pushed the feature/optimize_concat_function branch from ceba13d to 9d985c1 Compare March 30, 2024 14:04

JasonLi-cn marked this pull request as ready for review March 30, 2024 14:25

alamb mentioned this pull request Apr 2, 2024

DataFusion weekly project plan (Andrew Lamb) - April 1, 2024 #9899

Closed

7 tasks

alamb approved these changes Apr 3, 2024

View reviewed changes

chore: add annotation

078c2d7

alamb merged commit 24fc99c into apache:main Apr 4, 2024

alamb mentioned this pull request Apr 7, 2024

feat: optimize lower and upper functions #9971

Merged

Conversation

JasonLi-cn commented Mar 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

Benchmark(only concat)

Attention

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb Mar 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JasonLi-cn Mar 30, 2024

Choose a reason for hiding this comment

Uh oh!

JasonLi-cn Mar 30, 2024

Choose a reason for hiding this comment

Uh oh!

alamb Mar 22, 2024

Choose a reason for hiding this comment

Uh oh!

JasonLi-cn Mar 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb commented Mar 22, 2024

Uh oh!

alamb commented Mar 24, 2024

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Apr 3, 2024

Choose a reason for hiding this comment

Uh oh!

JasonLi-cn Apr 4, 2024

Choose a reason for hiding this comment

Uh oh!

alamb Apr 3, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alamb commented Apr 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JasonLi-cn commented Mar 22, 2024 •

edited

Loading

alamb Mar 22, 2024 •

edited

Loading

JasonLi-cn Mar 30, 2024 •

edited

Loading