Remove ArrayData from Array (#3880)#4061
Conversation
| } | ||
| } | ||
|
|
||
| /// Returns byte width for binary value_offset buffer spec. |
There was a problem hiding this comment.
This logic had to be reworked to not rely on a non-zero ArrayData::offset to trigger the re-encoding logic
|
|
||
| for i in 0..sliced_array.len() { | ||
| if bit_util::get_bit(&null_bits, sliced_array.offset() + i) { | ||
| if bit_util::get_bit(&null_bits, 1 + i) { |
There was a problem hiding this comment.
This change is necessary because the array no longer "keeps" the offset
There was a problem hiding this comment.
As above I think maybe it would make sense to at least deprecate Array::offset as it is always 0 now (and as part of the deprecation message explain why)
| /// | ||
| /// ``` | ||
| /// use arrow_array::{Array, Int32Array}; | ||
| /// use arrow_array::{Array, BooleanArray}; |
There was a problem hiding this comment.
Switch to BooleanArray which still "preserves" the offset
There was a problem hiding this comment.
I think we should deprecate this API for the reasons explained abve
| #[should_panic( | ||
| expected = "FixedSizeListArray child array length should be a multiple of 3" | ||
| )] | ||
| #[should_panic(expected = "assertion failed: (offset + length) <= self.len()")] |
There was a problem hiding this comment.
Given this is a sanity check, and implies an invalid ArrayData, I'm not sure the complexity of having a nicer error message is worth it
| if *len > 0 { | ||
| // check that child data is multiple of length | ||
| assert_eq!( | ||
| values.len() % *len as usize, |
There was a problem hiding this comment.
This test is redundant with the slice below
| assert_eq!( | ||
| data.buffers().len(), | ||
| 0, | ||
| "FixedSizeListArray data should not contain a buffer for value offsets" | ||
| ); | ||
| assert_eq!( | ||
| data.child_data().len(), | ||
| 1, | ||
| "FixedSizeListArray should contain a single child array (values array)" | ||
| ); |
There was a problem hiding this comment.
These tests are redundant as they would imply an invalid ArrayData
| ); | ||
| assert_eq!(2, fixed_size_binary_array.len()); | ||
| assert_eq!(5, fixed_size_binary_array.value_offset(0)); | ||
| assert_eq!(0, fixed_size_binary_array.value_offset(0)); |
There was a problem hiding this comment.
The slice pushdown results in this changing
alamb
left a comment
There was a problem hiding this comment.
Very impressive @tustvold -- some said it couldn't be done but seeing your string of incremental PRs over the last few months to remove ArrayData with the strongly typed abstractions has been inspirational
I know it seemingly took longer than it might have without the constraints of existing users, I think this change will be that much more valuable over the long term.
So thank you again
| self.values.is_empty() | ||
| } | ||
|
|
||
| /// Returns a zero-copy slice of this array with the indicated offset and length. |
| Self { | ||
| data_type: self.data_type.clone(), | ||
| keys: self.keys.slice(offset, length), | ||
| values: self.values.clone(), |
There was a problem hiding this comment.
the values aren't sliced because the keys may still refer to any logical value in the dictionary
| } | ||
|
|
||
| fn get_buffer_memory_size(&self) -> usize { | ||
| self.dictionary.get_buffer_memory_size() |
There was a problem hiding this comment.
should this function and the one below it also account for the self.values?
There was a problem hiding this comment.
self.values will get counted by self.dictionary, self.values is just a downcast borrow of the inner array
| /// All elements have the same length as the array is a fixed size. | ||
| #[inline] | ||
| pub fn value_length(&self) -> i32 { | ||
| self.length |
There was a problem hiding this comment.
I renamed length to value_length to avoid confusion with len
| self.len == 0 | ||
| } | ||
|
|
||
| fn offset(&self) -> usize { |
There was a problem hiding this comment.
what is the meaning of offset in this new world where the offset is stored on the underlying typed array? Maybe we should remove it from the Array trait 🤔
There was a problem hiding this comment.
Yeah, it's certainly doesn't have clear semantics anymore. As this PR is already quite large I'll remove it in a follow on PR
|
|
||
| for i in 0..sliced_array.len() { | ||
| if bit_util::get_bit(&null_bits, sliced_array.offset() + i) { | ||
| if bit_util::get_bit(&null_bits, 1 + i) { |
There was a problem hiding this comment.
As above I think maybe it would make sense to at least deprecate Array::offset as it is always 0 now (and as part of the deprecation message explain why)
| /// Returns the total number of bytes of memory occupied by the buffers owned by this [MapArray]. | ||
| fn get_buffer_memory_size(&self) -> usize { | ||
| self.data.get_buffer_memory_size() | ||
| let mut size = self.entries.get_buffer_memory_size(); |
There was a problem hiding this comment.
What about the other fields like keys, values, etc?
There was a problem hiding this comment.
keys and values are stored within entries, they're the children of the StructArray. Perhaps it would be worth making entries StructArray 🤔
| /// | ||
| /// ``` | ||
| /// use arrow_array::{Array, Int32Array}; | ||
| /// use arrow_array::{Array, BooleanArray}; |
There was a problem hiding this comment.
I think we should deprecate this API for the reasons explained abve
| // TODO: Slice buffers directly (#3880) | ||
| self.data.slice(offset, length).into() | ||
| let (offsets, fields) = match self.offsets.as_ref() { | ||
| Some(offsets) => (Some(offsets.slice(offset, length)), self.fields.clone()), |
There was a problem hiding this comment.
I was confused by this at first -- some comments here (and/or in the self.offsets I think would help
Specifically I think if self.offsets is SOme(..) that means DenseUnion and the offsets point to the correct children so slicing the offsets but not the actual children makes sense
if self.offsets is None, then it is a sparse union, so all the children line up and we need to slice them individually.
I had to re-read https://arrow.apache.org/docs/format/Columnar.html#union-layout to make sure
| _ => unreachable!(), | ||
| } | ||
| } | ||
| let offsets = match start_offset.as_usize() { |
There was a problem hiding this comment.
could you please add some comments about what this is doing (namely that it is "rebasing" the offsets to all be zero based)? That would be easy to overlook and was somewhat encoded in the method names previously
|
As part of the completion of this project, @tustvold could you write up some description of the relationship between Buffer / PrimitveBuffer / ArrayData, etc? Perhaps we can take inspiration (or copy/modify) the guide from: If you think it makes sense I will file a ticket |
I think this is definitely valuable, and will mesh nicely with #3879 |
Filed #4071 |
Which issue does this PR close?
Closes #3880
Rationale for this change
See ticket
What changes are included in this PR?
Are there any user-facing changes?
Yes, this inherently changes the way that slicing behaves
It also adds a statically typed BooleanArray::slice that somehow was missed in #3930