Skip to content

Comments

Remove ArrayData from Array (#3880)#4061

Merged
tustvold merged 4 commits intoapache:masterfrom
tustvold:remove-array-data-from-arrays
Apr 12, 2023
Merged

Remove ArrayData from Array (#3880)#4061
tustvold merged 4 commits intoapache:masterfrom
tustvold:remove-array-data-from-arrays

Conversation

@tustvold
Copy link
Contributor

Which issue does this PR close?

Closes #3880

Rationale for this change

See ticket

What changes are included in this PR?

Are there any user-facing changes?

Yes, this inherently changes the way that slicing behaves

It also adds a statically typed BooleanArray::slice that somehow was missed in #3930

@github-actions github-actions bot added the arrow Changes to the arrow crate label Apr 12, 2023
}
}

/// Returns byte width for binary value_offset buffer spec.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic had to be reworked to not rely on a non-zero ArrayData::offset to trigger the re-encoding logic


for i in 0..sliced_array.len() {
if bit_util::get_bit(&null_bits, sliced_array.offset() + i) {
if bit_util::get_bit(&null_bits, 1 + i) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is necessary because the array no longer "keeps" the offset

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above I think maybe it would make sense to at least deprecate Array::offset as it is always 0 now (and as part of the deprecation message explain why)

///
/// ```
/// use arrow_array::{Array, Int32Array};
/// use arrow_array::{Array, BooleanArray};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switch to BooleanArray which still "preserves" the offset

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should deprecate this API for the reasons explained abve

#[should_panic(
expected = "FixedSizeListArray child array length should be a multiple of 3"
)]
#[should_panic(expected = "assertion failed: (offset + length) <= self.len()")]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this is a sanity check, and implies an invalid ArrayData, I'm not sure the complexity of having a nicer error message is worth it

if *len > 0 {
// check that child data is multiple of length
assert_eq!(
values.len() % *len as usize,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is redundant with the slice below

Comment on lines -166 to -175
assert_eq!(
data.buffers().len(),
0,
"FixedSizeListArray data should not contain a buffer for value offsets"
);
assert_eq!(
data.child_data().len(),
1,
"FixedSizeListArray should contain a single child array (values array)"
);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are redundant as they would imply an invalid ArrayData

);
assert_eq!(2, fixed_size_binary_array.len());
assert_eq!(5, fixed_size_binary_array.value_offset(0));
assert_eq!(0, fixed_size_binary_array.value_offset(0));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The slice pushdown results in this changing

@tustvold tustvold added the api-change Changes to the arrow API label Apr 12, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very impressive @tustvold -- some said it couldn't be done but seeing your string of incremental PRs over the last few months to remove ArrayData with the strongly typed abstractions has been inspirational

I know it seemingly took longer than it might have without the constraints of existing users, I think this change will be that much more valuable over the long term.

So thank you again

self.values.is_empty()
}

/// Returns a zero-copy slice of this array with the indicated offset and length.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Self {
data_type: self.data_type.clone(),
keys: self.keys.slice(offset, length),
values: self.values.clone(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the values aren't sliced because the keys may still refer to any logical value in the dictionary

}

fn get_buffer_memory_size(&self) -> usize {
self.dictionary.get_buffer_memory_size()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this function and the one below it also account for the self.values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.values will get counted by self.dictionary, self.values is just a downcast borrow of the inner array

/// All elements have the same length as the array is a fixed size.
#[inline]
pub fn value_length(&self) -> i32 {
self.length
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bug, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed length to value_length to avoid confusion with len

self.len == 0
}

fn offset(&self) -> usize {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the meaning of offset in this new world where the offset is stored on the underlying typed array? Maybe we should remove it from the Array trait 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's certainly doesn't have clear semantics anymore. As this PR is already quite large I'll remove it in a follow on PR


for i in 0..sliced_array.len() {
if bit_util::get_bit(&null_bits, sliced_array.offset() + i) {
if bit_util::get_bit(&null_bits, 1 + i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above I think maybe it would make sense to at least deprecate Array::offset as it is always 0 now (and as part of the deprecation message explain why)

/// Returns the total number of bytes of memory occupied by the buffers owned by this [MapArray].
fn get_buffer_memory_size(&self) -> usize {
self.data.get_buffer_memory_size()
let mut size = self.entries.get_buffer_memory_size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the other fields like keys, values, etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keys and values are stored within entries, they're the children of the StructArray. Perhaps it would be worth making entries StructArray 🤔

///
/// ```
/// use arrow_array::{Array, Int32Array};
/// use arrow_array::{Array, BooleanArray};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should deprecate this API for the reasons explained abve

// TODO: Slice buffers directly (#3880)
self.data.slice(offset, length).into()
let (offsets, fields) = match self.offsets.as_ref() {
Some(offsets) => (Some(offsets.slice(offset, length)), self.fields.clone()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was confused by this at first -- some comments here (and/or in the self.offsets I think would help

Specifically I think if self.offsets is SOme(..) that means DenseUnion and the offsets point to the correct children so slicing the offsets but not the actual children makes sense

if self.offsets is None, then it is a sparse union, so all the children line up and we need to slice them individually.

I had to re-read https://arrow.apache.org/docs/format/Columnar.html#union-layout to make sure

_ => unreachable!(),
}
}
let offsets = match start_offset.as_usize() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please add some comments about what this is doing (namely that it is "rebasing" the offsets to all be zero based)? That would be easy to overlook and was somewhat encoded in the method names previously

@alamb
Copy link
Contributor

alamb commented Apr 12, 2023

As part of the completion of this project, @tustvold could you write up some description of the relationship between Buffer / PrimitveBuffer / ArrayData, etc?

Perhaps we can take inspiration (or copy/modify) the guide from:
https://jorgecarleitao.github.io/arrow2/main/guide/

If you think it makes sense I will file a ticket

@tustvold
Copy link
Contributor Author

If you think it makes sense I will file a ticket

I think this is definitely valuable, and will mesh nicely with #3879

@alamb
Copy link
Contributor

alamb commented Apr 12, 2023

If you think it makes sense I will file a ticket

Filed #4071

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-change Changes to the arrow API arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

First-Class Array Abstractions

2 participants