ARROW-11651: [Rust][DataFusion] Implement Postgres String Functions: Length Functions#9509
ARROW-11651: [Rust][DataFusion] Implement Postgres String Functions: Length Functions#9509seddonm1 wants to merge 7 commits intoapache:masterfrom
Conversation
There was a problem hiding this comment.
Thanks a lot @seddonm1 . This looks great to me.
I left two comments that imo should be addressed related to reducing the maintenance cost by DRY.
| fn bit_length_string<OffsetSize>(array: &Array, data_type: DataType) -> ArrayRef | ||
| where | ||
| OffsetSize: OffsetSizeTrait, | ||
| { | ||
| // note: offsets are stored as u8, but they can be interpreted as OffsetSize | ||
| let offsets = &array.data_ref().buffers()[0]; | ||
| // this is a 30% improvement over iterating over u8s and building OffsetSize, which | ||
| // justifies the usage of `unsafe`. | ||
| let slice: &[OffsetSize] = | ||
| &unsafe { offsets.typed_data::<OffsetSize>() }[array.offset()..]; | ||
|
|
||
| let bit_size = OffsetSize::from_usize(8).unwrap(); | ||
| let lengths = slice | ||
| .windows(2) | ||
| .map(|offset| (offset[1] - offset[0]) * bit_size); | ||
|
|
||
| // JUSTIFICATION | ||
| // Benefit | ||
| // ~60% speedup | ||
| // Soundness | ||
| // `values` is an iterator with a known size. | ||
| let buffer = unsafe { Buffer::from_trusted_len_iter(lengths) }; | ||
|
|
There was a problem hiding this comment.
Isn't this a copy of length, with a single difference, that we multiply the lengths by bit_size?
If that is the case, can we create a generic or something that avoids code duplication? Copy-pasting code like this is IMO an unnecessary maintenance burden.
There was a problem hiding this comment.
yes i can have a look at doing that.
There was a problem hiding this comment.
I have created a macro allowing just the core function to be replaced.
| /// $ARRAY_TYPE is the column type after function applied | ||
| macro_rules! test_function { | ||
| ($FUNC:ident, $ARGS:expr, $EXPECTED:expr, $EXPECTED_TYPE:ty, $DATA_TYPE: ident, $ARRAY_TYPE:ident) => { | ||
| println!("{:?}", BuiltinScalarFunction::$FUNC); |
There was a problem hiding this comment.
is this println! intended?
There was a problem hiding this comment.
it is just there to help debugging but happy to remove
|
|
||
| /// Returns number of characters in the string. | ||
| /// character_length_i64('josé') = 4 | ||
| pub fn character_length_i64(args: &[ArrayRef]) -> Result<ArrayRef> { |
There was a problem hiding this comment.
I think that this code can be reduced to a single generic. Something like this (untested):
pub fn character_length<T: ArrowPrimitiveType>(args: &[ArrayRef]) -> Result<ArrayRef>
where
T::Native: StringOffsetSizeTrait {
let string_array: &GenericStringArray<T::Native> = args[0]
.as_any()
.downcast_ref::<GenericStringArray<T::Native>>()
.unwrap();
string_array
.iter()
.map(|x| x.map(|x: &str| T::Native::from_usize(x.graphemes(true).count())).unwrap())
.collect()
}There was a problem hiding this comment.
Thanks I will have a go. There are a couple more instances this generic from_usize could help.
There was a problem hiding this comment.
Wow. Impressive you can come up with this in your head. I have incorporated this code.
|
Thanks @jorgecarleitao . I will update soon. |
|
@jorgecarleitao Thanks for the review. I have addressed all of your points. The |
|
Thanks a lot, @seddonm1 . I went through this PR again and I pull the code locally to understand the reason for the macro. I now get it that it is not easy to write this. :) I PRed something to your branch with a suggestion to reduce the risk of using that function (which is unsafe due to the |
Added some safety.
Codecov Report
@@ Coverage Diff @@
## master #9509 +/- ##
==========================================
+ Coverage 82.15% 82.28% +0.12%
==========================================
Files 243 244 +1
Lines 54908 55517 +609
==========================================
+ Hits 45108 45680 +572
- Misses 9800 9837 +37
Continue to review full report at Codecov.
|
|
@jorgecarleitao Sorry I updated this but forgot to update you. Once this is merged I can do more of the big Postgres string functions PRs. |
|
The integration test failure appears unrelated https://github.com/apache/arrow/pull/9509/checks?check_run_id=1921983282 I will restart it to try and get a clean run |
|
@seddonm1 we are so close -- now there is a clippy error: https://github.com/apache/arrow/pull/9509/checks?check_run_id=1922005826 |
|
@alamb done and ready. Thanks to @jorgecarleitao for his elite knowledge of trait implementations. |
jorgecarleitao
left a comment
There was a problem hiding this comment.
Thanks a lot, @seddonm1 ! Ready to go from my side.
|
Thanks again @seddonm1 -- this is epic work |
Splitting up #9243
This implements the following functions: