Fix Null Mask Handling in ArrayData, UnionArray, and MapArray#1589
Fix Null Mask Handling in ArrayData, UnionArray, and MapArray#1589alamb merged 5 commits intoapache:masterfrom
ArrayData, UnionArray, and MapArray#1589Conversation
|
|
||
| // Calculate a list child's logical bitmap/buffer | ||
| #[inline] | ||
| fn logical_list_bitmap<OffsetSize: OffsetSizeTrait>( |
There was a problem hiding this comment.
We no longer need this code, as we simply skip over null slices
There was a problem hiding this comment.
What you mean we skip over null slices?
There was a problem hiding this comment.
We only call equal_ranges on the non null ranges
| rhs_start: usize, | ||
| len: usize, | ||
| ) -> bool { | ||
| utils::base_equal(lhs, rhs) |
There was a problem hiding this comment.
utils::base_equal does two things:
- Compares that the data types are equal
- Compares that the lengths are equal
The first is not necessarily correct, as MapArrays are agnostic to field names. The second is also not necessarily correct in the presence of non-empty null slices. I therefore decided to just remove this and simplify the code.
Ultimately it isn't clear to me what this check could catch that wouldn't represent some violation of a variant somewhere else. If a nested type is inconsistent with the length, type, etc... of its parent then you've got UB anyway
There was a problem hiding this comment.
Actually equal already does utils::base_equal. I guess it is okay to remove it from here, although I'm not pretty sure.
arrow/src/array/equal/utils.rs
Outdated
| ) -> bool { | ||
| let lhs_null_count = count_nulls(lhs_nulls, lhs_start, len); | ||
| let rhs_null_count = count_nulls(rhs_nulls, rhs_start, len); | ||
| let lhs_null_count = count_nulls(lhs_nulls, lhs_start + lhs.offset(), len); |
There was a problem hiding this comment.
equal_bits a few lines is correctly taking into account the offset, but this line wasn't => excitement
| Bitmap::from(Buffer::from(vec![0b11111111; ceil])) | ||
| }); | ||
| match parent_data.data_type() { | ||
| DataType::List(_) | DataType::Map(_, _) => Some(logical_list_bitmap::<i32>( |
There was a problem hiding this comment.
No longer needed, as null slices are skipped over in parent
|
More stuff is broken 😭 |
Various UnionArray fixes (apache#1598) (apache#1596) (apache#1591) (apache#1590) Fix handling of null masks in ArrayData equality (apache#1599)
|
I will continue to add tests, and file bugs for the various things this fixes, but I think the core is ready for review now |
Codecov Report
@@ Coverage Diff @@
## master #1589 +/- ##
==========================================
+ Coverage 82.88% 83.04% +0.16%
==========================================
Files 193 193
Lines 55307 55276 -31
==========================================
+ Hits 45839 45902 +63
+ Misses 9468 9374 -94
Continue to review full report at Codecov.
|
|
Thank you @tustvold for fixing these. I will begin to review this today. |
| 0 => { | ||
| let slot = slot.as_any().downcast_ref::<Int32Array>().unwrap(); | ||
| assert!(!union.is_null(i)); | ||
| assert!(!slot.is_null(0)); |
There was a problem hiding this comment.
I think it is good to check null in the current slot, but does union.is_null(index) not work anymore? Can we check both?
There was a problem hiding this comment.
The nullability is only a property of the slot, and not the Union itself, at least that is my interpretation of the specification
There was a problem hiding this comment.
Says
Unlike other data types, unions do not have their own validity bitmap. Instead, the nullness of each slot is determined exclusively by the child arrays which are composed to create the union.
Which seems consistent 👍
However, it seems like UnionArray doesn't override is_null to take this into account. Filed #1625 to track
| let mut builder = UnionBuilder::new_sparse(2); | ||
| builder.append::<Float32Type>("a", 1.0).unwrap(); | ||
| let err = builder.append::<Int32Type>("a", 1).unwrap_err().to_string(); | ||
| assert!(err.contains("Attempt to write col \"a\" with type Int32 doesn't match existing type Float32"), "{}", err); |
| /// A builder for the bitmap if required (for Sparse Unions) | ||
| bitmap_builder: Option<BooleanBufferBuilder>, | ||
| bitmap_builder: BooleanBufferBuilder, |
There was a problem hiding this comment.
The comment seems out-of-date now?
arrow/src/array/builder.rs
Outdated
| .null_bit_buffer(bitmap_builder.finish()); | ||
| // .build(); |
There was a problem hiding this comment.
// .build(); seems redundant and can be removed?
| match bitmap_builder { | ||
| Some(mut bb) => arr_data_builder | ||
| .null_bit_buffer(bb.finish()) | ||
| .build_unchecked(), | ||
| None => arr_data_builder.build_unchecked(), | ||
| } |
| /// Can contain a null bitmask | ||
| pub can_contain_null_mask: bool, |
There was a problem hiding this comment.
Although the C++ structure definition doesn't have this, I think it is fine to have it here.
| let lhs_null_count = count_nulls(lhs.null_buffer(), lhs_start + lhs.offset(), len); | ||
| let rhs_null_count = count_nulls(rhs.null_buffer(), rhs_start + rhs.offset(), len); |
There was a problem hiding this comment.
Hmm, do we assume offset is zero previously?
There was a problem hiding this comment.
Yes, in lots of places 😅
| rhs_start: usize, | ||
| len: usize, | ||
| ) -> bool { | ||
| utils::base_equal(lhs, rhs) |
There was a problem hiding this comment.
Actually equal already does utils::base_equal. I guess it is okay to remove it from here, although I'm not pretty sure.
viirya
left a comment
There was a problem hiding this comment.
This looks good to me and makes some places more clear and simpler. It is better to have more eyes on the changes too.
|
I will try and review this more carefully tomorrow. cc @jhorstmann who might also be interested |
|
Looks good. This is some really gnarly code and last time I tried to improve it I gave up since it was never clear whether the start variable already include offsets or where the offsets have to be applied. I have two tests in #1499 where |
|
After the suggestion above, my remaining test failure was because of the |
Yes I have found the handling of this field to be very inconsistent, imo it shouldn't be possible to associate a null bitmask with a field that says it isn't nullable. Would you be able to create a ticket?
I presume you mean that the previous behaviour was wrong, and is now fixed, and not that this has broken behaviour that was previously correct?
I think the logic here will consider them not equal, are you suggesting they should be considered equal? |
I created #1611 for validating the
Sorry, my description probably was not very clear. In one test I had to compare individual elements of a ListArray because the previous equality implementation did not work with different offsets. With the changes from this PR that test can now use a simple The other test with Whether we should change the comparision semantic to exclude the nullable flag is probably out of scope for this PR, I don't have a strong opinion either way. A big "Thank You" for diving into this code and fixing it! |
alamb
left a comment
There was a problem hiding this comment.
I gave this code a thorough review -- I am not the biggest expert in the low level details of what is going on internally but the changes made sense to me.
I also carefully checked all test changes and they looked good, so assuming the tests have good coverage I feel confident in this change.
Of course, our work to work to clean up nullness is likely not done with this PR but it I think things are better.
My only concerns are:
-
the change to
append_nullthat requires a field name (as that is a public facing API). However, we can always change it in the future I suppose. -
#1625 seems non bad -- I think we could remove several changes in this PR (so it could call
union.is_null()rather thanslot.is_null()if we fixed that
Thank you @tustvold @viirya and @jhorstmann for driving this
cc @bjchambers who I think also has had much fun with offsets and @nevi-me who spent a bunch of time with this code I think
| 0 => { | ||
| let slot = slot.as_any().downcast_ref::<Int32Array>().unwrap(); | ||
| assert!(!union.is_null(i)); | ||
| assert!(!slot.is_null(0)); |
There was a problem hiding this comment.
Says
Unlike other data types, unions do not have their own validity bitmap. Instead, the nullness of each slot is determined exclusively by the child arrays which are composed to create the union.
Which seems consistent 👍
However, it seems like UnionArray doesn't override is_null to take this into account. Filed #1625 to track
| slots: usize, | ||
| /// A builder for the bitmap if required (for Sparse Unions) | ||
| bitmap_builder: Option<BooleanBufferBuilder>, | ||
| /// A builder for the null bitmap |
| }; | ||
| self.len += 1; | ||
| Ok(()) | ||
| pub fn append_null<T: ArrowPrimitiveType>(&mut self, type_name: &str) -> Result<()> { |
There was a problem hiding this comment.
I realize from an implementation perspective why you chose to require appending a null by field name, but it seems kind of a messy API -- what do you think about potentially appending a null to the first child or something by default?
Perhaps something to think about for a follow on PR
There was a problem hiding this comment.
The short answer is that UnionArray nulls are typed and so two arrays are not equal if they have nulls in different child arrays.
There is also the case of appending a null to an empty UnionArray...
There was a problem hiding this comment.
Makes sense -- tried to encode that insight in comments as part of #1628
| builder.append::<Int64Type>("c", 3).unwrap(); | ||
| builder.append::<Int32Type>("a", 10).unwrap(); | ||
| builder.append_null().unwrap(); | ||
| builder.append_null::<Int32Type>("a").unwrap(); |
There was a problem hiding this comment.
it is strange to require a field name to append nulls -- I left a comment on how to improve things perhaps in the future
| // Check that the data layout conforms to the spec | ||
| let layout = layout(&self.data_type); | ||
|
|
||
| if !layout.can_contain_null_mask && self.null_bitmap.is_some() { |
|
|
||
| let lhs_null_count = count_nulls(lhs_nulls, lhs_start, len); | ||
| let rhs_null_count = count_nulls(rhs_nulls, rhs_start, len); | ||
| let lhs_null_count = count_nulls(lhs.null_buffer(), lhs_start + lhs.offset(), len); |
There was a problem hiding this comment.
it makes a lot more sense to look at the buffer's nulls directly
| let filtered = c.as_any().downcast_ref::<UnionArray>().unwrap(); | ||
|
|
||
| let mut builder = UnionBuilder::new_dense(1); | ||
| let mut builder = UnionBuilder::new_sparse(2); |
There was a problem hiding this comment.
I assume this was changed because the name of the test is test_filter_union_array_sparse_with_nulls but we were using a dense array previously
|
After some thought, I would like to merge this PR to handle all the undefined behavior bugs While there are more potential follow on things to do to improve union and null handling, this seems like a significant improvement. Unless there are any objections, I will plan to merge it later today |
|
🚀 |
ArrayData, UnionArray, and MapArray
* Update version to 13.0.0 * Draft changelog * updates * moar * cleanup * remove duplicated entry, update for #1589 * Final updates
Which issue does this PR close?
Closes #1599
Closes #1598
Closes #1596
Closes #1590
Closes #1591
Closes #626
Rationale for this change
See tickets, unfortunately these changes all ended up interacting with each other and so I've lumped them into a single PR.
What changes are included in this PR?
ArrayData Equality
The currently ArrayData equality code attempts to construct a logical bitmask, that propogates the nullability of the parent down to its children. Certain arrays, however, may have a different number of children than their parent, e.g. dense union array or list arrays. This requires recomputing the mask to factor this in. In some cases, this was computing a mask that took account of the offset of the child, but in others it was not. I also don't think any of the codepaths took account for if the child also contained an offset.
Taking a step back, none of the codepaths are actually comparing runs of nulls, instead they fall back to iterating per value if the array contains nulls. For example,
As a result we don't actually need this complexity, and by removing it we avoid ambiguity about what offset the null mask has, what the null mask is, etc...
UnionArray Nullability
Whilst working on this I realised that UnionArray's don't actually have a parent validity mask, as this change would have implications for the equality logic I lumped it into this PR.
Are there any user-facing changes?
Array equality should be more correct, and UnionArrays slightly less broken