ARROW-4605: [Rust] Move filter and limit code from DataFusion into compute module#3741
ARROW-4605: [Rust] Move filter and limit code from DataFusion into compute module#3741ntrinquier wants to merge 14 commits intoapache:masterfrom ntrinquier:ARROW-4605
Conversation
rust/arrow/src/compute/array_ops.rs
Outdated
| } | ||
|
|
||
| pub fn limit(array: &Array, num_rows_to_read: usize) -> Result<ArrayRef> { | ||
| if array.len() < num_rows_to_read { |
There was a problem hiding this comment.
I'm curious as to why we return an error here. Why not set the limit to the lower of the 2 instead?
vec![1,2,3,4,5].iter().take(7) doesn't throw an error, and SELECT * FROM table_with_100_records LIMIT 200 will also only return the 100 records. What do you think about the suggestion?
There was a problem hiding this comment.
My initial thought was that if someone calls this function with a num_rows_to_read larger than array.len(), they will probably want to know rather than silently choosing the minimum.
However given take and LIMIT behavior this is indeed probably not what is expected from this function. I will change it, good call.
There was a problem hiding this comment.
I think if you limit the num_rows_to_read to the array length.
Have a look at https://issues.apache.org/jira/browse/ARROW-3954, which will make filter more efficient in future.
…String > &str > &[u8]
create BinaryArray directly from byte slice
| /// | ||
| /// Returns the whole array if the number of elements specified is larger than the length of the array | ||
| pub fn limit(array: &Array, num_elements: usize) -> Result<ArrayRef> { | ||
| let num_elements_safe: usize = cmp::min(array.len(), num_elements); |
There was a problem hiding this comment.
One last nit, we could return the array as ArrayRef immediately if the limit >= len. I sold have thought of it earlier, my apologies.
I'm happy with everything else
There was a problem hiding this comment.
Maybe you can help me here: how can I wrap the reference to array in an Arc<Array> / ArrayRef? Since the reference to array can be freed at any time it would leave the Arc invalid, so I have to specify the lifetime somehow. Can I do that with Arc?
There was a problem hiding this comment.
No, you're right, I missed that part. We can improve limit when we have zero-copy array slicing 👍🏾
andygrove
left a comment
There was a problem hiding this comment.
LGTM but I made one suggestion
rust/arrow/src/array.rs
Outdated
|
|
||
| impl<'a> From<Vec<&[u8]>> for BinaryArray { | ||
| fn from(v: Vec<&[u8]>) -> Self { | ||
| let mut offsets = vec![]; |
There was a problem hiding this comment.
It would be more efficient to initialize these vectors with Vec::with_capacity(v.len())
rust/arrow/src/array.rs
Outdated
There was a problem hiding this comment.
Sorry, I guess this should be Vec::with_capacity(v.len() + 1). Also you have an extra ;
andygrove
left a comment
There was a problem hiding this comment.
Sorry, one more comment ...
rust/arrow/src/array.rs
Outdated
There was a problem hiding this comment.
After reviewing this again, I realize that this change is good for offsets but maybe not for values. We can only allocate values to the correct size if we first go and sum the lengths of all of the input strings, which would be expensive. Maybe we should just go back to vec![] or Vec::new() for this one?
There was a problem hiding this comment.
Ah, yes. Also since v can contain empty strings, we do not even have a lower bound for an initial capacity, so I'll revert to Vec::new() for values.
andygrove
left a comment
There was a problem hiding this comment.
LGTM - thanks @ntrinquier !
No description provided.