Support Utf8View for bit_length kernel#6671
Conversation
Signed-off-by: Austin Liu <[email protected]>
tustvold
left a comment
There was a problem hiding this comment.
I think this needs a test, which I think will show this not working
Signed-off-by: Austin Liu <[email protected]> Add test & handle view bytes length counting Signed-off-by: Austin Liu <[email protected]>
alamb
left a comment
There was a problem hiding this comment.
Thanks @austin362667 -- this is looking good. I started the CI and left some comments
Signed-off-by: Austin Liu <[email protected]>
arrow-string/src/length.rs
Outdated
| let bit_lengths = list | ||
| .views() | ||
| .iter() | ||
| .map(|view| (*view as i32) * 8) |
There was a problem hiding this comment.
I think the length is currently treated as a u32 elsewhere in the code
However the actual Arrow spec I think says use i32 so this is probably fine
There was a problem hiding this comment.
I also still do think you need to handle the null cases (check for which slots are null) as there might be a view of a non zero length there
Signed-off-by: Austin Liu <[email protected]>
arrow-string/src/length.rs
Outdated
| let values = list | ||
| .views() | ||
| .iter() | ||
| .enumerate() | ||
| .map(|(i, _)| { | ||
| if list.is_valid(i) { | ||
| list.views()[i] as u32 * 8 | ||
| } else { | ||
| 0 |
There was a problem hiding this comment.
In #6662 (comment) it was suggested to use from_unary.
It this applicable here?
There was a problem hiding this comment.
@findepi Thanks for bringing this up. I tried it and it works trivially well.
There was a problem hiding this comment.
Using from_unary reverts the optimisation of only inspecting the views, and not the string data.
Perhaps you might use the example I wrote in #6671 (comment)
arrow-string/src/length.rs
Outdated
| let result = bit_length(&array).unwrap(); | ||
| assert_eq!(len, result.len()); | ||
| let result = result.as_any().downcast_ref::<Int32Array>().unwrap(); | ||
| let result = result.as_any().downcast_ref::<UInt32Array>().unwrap(); |
There was a problem hiding this comment.
I agree that bit_length should not beed to return signed integer, but it seems better not to change this within this PR.
Signed-off-by: Austin Liu <[email protected]>
arrow-string/src/length.rs
Outdated
| let values = (0..list.len()) | ||
| .map(|i| { | ||
| if list.is_valid(i) { | ||
| list.views()[i] as u32 * 8 | ||
| } else { | ||
| 0 | ||
| } | ||
| }) | ||
| .collect::<Vec<u32>>(); | ||
| Ok(Arc::new(UInt32Array::new( | ||
| values.into(), |
There was a problem hiding this comment.
| let values = (0..list.len()) | |
| .map(|i| { | |
| if list.is_valid(i) { | |
| list.views()[i] as u32 * 8 | |
| } else { | |
| 0 | |
| } | |
| }) | |
| .collect::<Vec<u32>>(); | |
| Ok(Arc::new(UInt32Array::new( | |
| values.into(), | |
| let values = list.views().iter().map(|view| (view as i32).wrapping_mul(8)).collect(); | |
| Ok(Arc::new(Int32Array::new( | |
| values, |
As we're doing infallible wrapping arithmetic, we can just compute this for everything, this will vectorize better, and the null mask is preserved below
There was a problem hiding this comment.
Wow! Thank you so much @tustvold , you are indeed making me a better Rust programmer. Really appreciate for review. 👍
tustvold
left a comment
There was a problem hiding this comment.
I think this can be further simplified, I also think we should return Int32Array for consistency with the other implementations.
Signed-off-by: Austin Liu <[email protected]>
Signed-off-by: Austin Liu <[email protected]>
|
Switching to from_unary has reverted the optimisation to not inspect the string data - #6671 (comment) is how to do this correctly |
Signed-off-by: Austin Liu <[email protected]>
Signed-off-by: Austin Liu <[email protected]>
There was a problem hiding this comment.
Looks great to me -- thank you @austin362667 and @tustvold and @findepi
Utf8View for string function bit_length()Utf8View for bit_length kernel
Which issue does this PR close?
Closes apache/datafusion#13195
Rationale for this change
Thanks to @jayzhan211 , he noticed following issue, array compute kernel
bit_length()doesn't supportUtf8Viewtype:caused the error:
What changes are included in this PR?
Update
bit_length()array function to supportUtf8ViewAre there any user-facing changes?