Skip to content

Improve Array Logical Nullability#4691

Merged
tustvold merged 1 commit intoapache:masterfrom
tustvold:logical-nullability
Aug 14, 2023
Merged

Improve Array Logical Nullability#4691
tustvold merged 1 commit intoapache:masterfrom
tustvold:logical-nullability

Conversation

@tustvold
Copy link
Contributor

@tustvold tustvold commented Aug 12, 2023

Which issue does this PR close?

Closes #4690
Closes #4689
Closes #4688
Closes #2687
Closes #4616
Relates to #4565
Relates to #4835

Rationale for this change

Various types have nullability that is not accurately reflected in their reported NullBuffer, this causes incorrect behaviour for various kernels that have special handling of the null buffers.

What changes are included in this PR?

Adds two methods to Array:

  • Array::logical_nulls - which computes the logical null mask
  • Array::is_nullable - which only returns false if an array cannot contain nulls

This logical null mask can then be used in the places where the logic relies on logical nullability

Are there any user-facing changes?

@tustvold tustvold added the api-change Changes to the arrow API label Aug 12, 2023
@github-actions github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate labels Aug 12, 2023
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer need to special case NullArray, whilst also now correctly handling DictionaryArray with value nulls

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test for #4616

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I debated calling this method is_required instead, but figured it was better to stick with consistent terminology, and this matches the method on Field

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change, although I expect the impact to minimal. I debated making is_null and friends always return logical nullability, but this would have been inconsistent with ArrayData and generally seemed like it might just cause more confusion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a test for #4688

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test for #2687

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test for #4690

@tustvold tustvold force-pushed the logical-nullability branch from ab28cfe to 721d65e Compare August 12, 2023 14:03
@tustvold tustvold force-pushed the logical-nullability branch from 721d65e to 55a420b Compare August 12, 2023 14:04
self.as_ref().nulls()
}

fn logical_nulls(&self) -> Option<NullBuffer> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered having this eagerly constructed and stored on the arrays, but I felt this would be more surprising as it would make the arrays diverge from the arrow specification. Whilst this does raise the prospect of computing this NullBuffer multiple times, the cases where this is required are fairly rare. We can always optimise in future should this become a problem

@alamb alamb changed the title Logical Nullability Improve Array Logical Nullability Aug 13, 2023

/// Returns the logical null buffer of this array if any
///
/// In most cases this will be the same as [`Array::nulls`], except for:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Array::nulls should be deprecated and a non-deprecated version physical_nulls should be introduced. Users will misuse nulls because:

  • most people will read over the small note within the doc string
  • nulls is easier to find than logical_nulls
  • there are two options of "which null", and we offer an implicit default (because it is not prefixed with a semantic) that is clearly the wrong one for many use cases

Copy link
Contributor Author

@tustvold tustvold Aug 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whilst I do sort of agree, I'm not sure this wouldn't just cause more confusion, as this isn't a distinction found in the arrow spec. Would we also deprecate is_null, is_valid, etc... I think the current API hasn't been a major issue thus far, as the cases where it matters are rare.

FWIW once we have StringView I think there is basically no reason to use DictionaryArray or RunArray

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-change Changes to the arrow API arrow Changes to the arrow crate parquet Changes to the parquet crate

Projects

None yet

2 participants