Conversation
alamb
left a comment
There was a problem hiding this comment.
Looks very nice to me @JasonLi-cn -- thank you
| } | ||
| } | ||
|
|
||
| struct StringArrayBuilder { |
There was a problem hiding this comment.
I think some comments that explained how this was different than https://docs.rs/arrow/latest/arrow/array/type.StringBuilder.html would help. Maybe simply a note that it didn't check UTF8 again?
I wonder if we could get the same effect by adding an unsafe function to StringBuilder, like
/// Adds bytes to the in progress string, without checking for valid utf8
///
/// Safety: requires that bytes are valid utf8, otherwise an invalid StringArray will result
unsafe fn append_unchecked(&mut self, bytes: &[u8]) And then using StringBuilder here
There was a problem hiding this comment.
The Write trait impl of StringBuilder can meet my current needs, but it is not very convenient to use, so I've defined a StringArrayBuilder. I agree with your suggestion to add an unsafe function to StringBuilder.
let mut builder = GenericStringBuilder::<i32>::new();
// Write data
write!(builder, "foo").unwrap();
write!(builder, "bar").unwrap();
// Finish value
builder.append_value("baz");
// Write second value
write!(builder, "v2").unwrap();
builder.append_value("");
let array = builder.finish();
assert_eq!(array.value(0), "foobarbaz");
assert_eq!(array.value(1), "v2");There was a problem hiding this comment.
Another issue is that we don't need NullBufferBuilder here, but GenericByteBuilder defaults to using NullBufferBuilder, which I believe introduces unnecessary overhead.
| fn write<const CHECK_VALID: bool>(&mut self, column: &ColumnarValueRef, i: usize) { | ||
| match column { | ||
| ColumnarValueRef::Scalar(s) => { | ||
| self.value_buffer.extend_from_slice(s); |
There was a problem hiding this comment.
Is the primary speed savings gained from not checking UTF8 validity (and just copying byte slices)?
There was a problem hiding this comment.
- My initial intention for optimizing this function was to avoid creating a new String and then using
push_streach time inconcatfunction, like:
let mut owned_string: String = "".to_owned();
for arg in args {
match arg {
ColumnarValue::Scalar(ScalarValue::Utf8(maybe_value)) => {
if let Some(value) = maybe_value {
owned_string.push_str(value);
}
}
ColumnarValue::Array(v) => {
if v.is_valid(index) {
let v = as_string_array(v).unwrap();
owned_string.push_str(v.value(index));
}
}
_ => unreachable!(),
}
}
Some(owned_string)- Additionally, by precalculating the expected length of the result, I avoided the need to reallocate memory.
- I used the
extend_from_slicefunction because I referred to theappend_slicefunction ofBufferBuilder.
#[inline]
pub fn append_slice(&mut self, slice: &[T]) {
self.buffer.extend_from_slice(slice);
self.len += slice.len();
}|
I also filed #9742 to track this improvement |
|
Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look |
fix: concat_ws chore: add license header add arrow feature update concat
ceba13d to
9d985c1
Compare
alamb
left a comment
There was a problem hiding this comment.
THank you @JasonLi-cn -- I think this looks good to me
I had some suggestions to improve the comments but I think we could do that as a follow on PR if you prefer
| let mut offsets_buffer = MutableBuffer::with_capacity( | ||
| (item_capacity + 1) * std::mem::size_of::<i32>(), | ||
| ); | ||
| unsafe { offsets_buffer.push_unchecked(0_i32) }; |
There was a problem hiding this comment.
Is it really necessary to avoid the bounds check for a single offset?
There was a problem hiding this comment.
Since it is safe here and there is a theoretical performance improvement, I have opted to use push_unchecked in this case.
| unsafe { self.offsets_buffer.push_unchecked(next_offset) }; | ||
| } | ||
|
|
||
| fn finish(self, null_buffer: Option<NullBuffer>) -> StringArray { |
There was a problem hiding this comment.
I was trying to think if there was some way to create an invalid result with this API and I think the answer is No (even if append_offset wasn't called ever the offsets would still be valid.
|
Thanks again @JasonLi-cn 🙏 |
Which issue does this PR close?
Closes #9742
Rationale for this change
optimize
concatandconcat_wsfunction.Benchmark(only concat)
Gnuplot not found, using plotters backend concat function/concat(old)/1024 time: [91.562 µs 91.725 µs 91.905 µs] Found 8 outliers among 100 measurements (8.00%) 6 (6.00%) high mild 2 (2.00%) high severe concat function/concat(new)/1024 time: [17.831 µs 17.877 µs 17.934 µs] Found 9 outliers among 100 measurements (9.00%) 7 (7.00%) high mild 2 (2.00%) high severe concat function/concat(old)/4096 time: [357.95 µs 358.98 µs 360.02 µs] Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high severe concat function/concat(new)/4096 time: [71.003 µs 71.168 µs 71.326 µs] Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild concat function/concat(old)/8192 time: [758.82 µs 761.43 µs 764.82 µs] Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) high mild 1 (1.00%) high severe concat function/concat(new)/8192 time: [141.11 µs 141.44 µs 141.79 µs]Attention
For the purpose of benchmarking, I haven't officially replaced the
concatandconcat_wsfunction yet. If the community finds this PR meaningful, I will proceed with the replacement.What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?