Conversation
Signed-off-by: remzi <[email protected]>
Signed-off-by: remzi <[email protected]>
Signed-off-by: remzi <[email protected]>
Signed-off-by: remzi <[email protected]>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #1784 +/- ##
==========================================
+ Coverage 83.39% 83.42% +0.02%
==========================================
Files 198 198
Lines 56142 56303 +161
==========================================
+ Hits 46822 46969 +147
- Misses 9320 9334 +14 ☔ View full report in Codecov by Sentry. |
tustvold
left a comment
There was a problem hiding this comment.
I think this looks good, left some comments on how it could be improved - but happy for this to be a follow up PR
| length.to_usize().unwrap().min(char_count - start) | ||
| }); | ||
|
|
||
| val.chars().skip(start).take(length).collect::<String>() |
There was a problem hiding this comment.
Allocating a string temporary, only to copy out of it, is likely a significant portion of the slow-down. That combined with the null handling.
This could definitely be handled as a separate PR, but you might want to consider doing something like (not tested).
let nulls = // align bitmap to 0 offset, copying if already aligned
let mut vals = BufferBuilder::<u8>::new(array.value_data().len());
let mut indexes = BufferBuilder::<OffsetSize>::new(array.len() + 1);
indexes.append(0);
for val in array.iter() {
let char_count = val.chars().count();
let start = if start >= 0 {
start.to_usize().unwrap().min(char_count)
} else {
char_count - (-start).to_usize().unwrap().min(char_count)
};
let length = length.map_or(char_count - start, |length| {
length.to_usize().unwrap().min(char_count - start)
});
let mut start_byte = 0;
let mut end_byte = val.len();
for ((idx, (byte_idx, _)) in val.char_indices().enumerate() {
if idx == start {
start_byte = byte_idx;
} else if idx == start + length {
end_byte = byte_idx;
break
}
}
// Could even be unchecked
vals.append_slice(&val[start_byte..end_byte]);
indexes.append(vals.len() as _);
}
let data = ArrayDataBuilder::new(array.data_type()).len(array.len()).add_buffer(vals.finish())
.add_buffer(indexes.finish()).add_buffer(vals.finish());
Ok(GenericStringArray::<OffsetSize>::from(unsafe {data.build_unchecked()}))
| (input_vals.clone(), -4, Some(4), vec!["ello", "", "⊢x:T"]), | ||
| ]; | ||
|
|
||
| cases.into_iter().try_for_each::<_, Result<()>>( |
There was a problem hiding this comment.
Perhaps we could extract this as a function instead of duplicating it?
| } | ||
|
|
||
| fn generic_string_by_char_with_non_zero_offset<O: OffsetSizeTrait>() -> Result<()> { | ||
| let values = "S→T = Πx:S.T"; |
There was a problem hiding this comment.
I think it would be easier to follow to just construct the regular array, and then call array.slice(1, 2)
| generic_string_with_non_zero_offset::<i64>() | ||
| } | ||
|
|
||
| fn with_nulls_generic_string_by_char<O: OffsetSizeTrait>() -> Result<()> { |
There was a problem hiding this comment.
Loving the test coverage 👍
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Which issue does this PR close?
Closes #1768.
Rationale for this change
Support substring by char
What changes are included in this PR?
substring_by_charPerformance
1/6 of the speed of
substring(by byte)Are there any user-facing changes?
Yes.