Skip to content

ARROW-12659: [C++][Compute] Support is_valid as a guarantee#10253

Closed
bkietz wants to merge 3 commits intoapache:masterfrom
bkietz:12659-Support-SimplifyWithGuara
Closed

ARROW-12659: [C++][Compute] Support is_valid as a guarantee#10253
bkietz wants to merge 3 commits intoapache:masterfrom
bkietz:12659-Support-SimplifyWithGuara

Conversation

@bkietz
Copy link
Member

@bkietz bkietz commented May 5, 2021

This allows us to take advantage of a partition expression like f32 >= 1 and f32 <= 2 and is_valid(f32), which could be generated for a row group with no null values.

Note this implies that f32 >= 1 and f32 <= 2 is a guarantee that "f32 is between 1 and 2, or is null". This is probably fine vs the more pedantic (f32 >= 1 and f32 <= 2) or is_null(f32), but requires a potentially problematic addendum to the contract of guarantees: guarantees do not apply to null arguments unless an explicit is_null or is_valid is included

@bkietz bkietz requested a review from lidavidm May 5, 2021 20:12
@github-actions
Copy link

github-actions bot commented May 5, 2021

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Regarding guarantees, I think assuming nullability by default is fine, though we should perhaps be more thorough with testing with known null/non-null fields downstream in Datasets.

@bkietz
Copy link
Member Author

bkietz commented May 5, 2021

I think assuming nullability by default is fine

where this gets semantically sticky is: we use i32 == 64 as a guarantee that it's always 64 and never null

@lidavidm
Copy link
Member

lidavidm commented May 6, 2021

Ah. I guess we should document the semantics we expect - right now it seems we're mostly translating implicitly what people expect from Parquet/Hive which is fine, but imprecise.

@bkietz bkietz force-pushed the 12659-Support-SimplifyWithGuara branch from 2225342 to 0bc5b4d Compare May 6, 2021 15:35
min = maybe_min.MoveValueUnsafe();
max = maybe_max.MoveValueUnsafe();
if (min->Equals(max)) {
return compute::equal(std::move(field_expr), compute::literal(std::move(min)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add an is_valid here too if applicable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Equality expressions will be picked up by the ExtractKnownFieldValues pass, which assumes the field will min and never null (questionably correct)

Longer term: I think we need to be more rigorous with the guarantee contract. Specifically: any guarantee should be usable as a filter, and in that case will evaluate to an array containing only true (and no nulls).

Data, Guarantee Filter Evaluation of guarantee on Data
[2, 2], i32 == 2 i32 == 2 [true, true] can simplify to true
[2, null], i32 == 2 or is_null(i32) i32 == 2 [true, null] the second row should be excluded from the filter

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. I think a row group with [2, null] right now might actually get filtered out accidentally if your filter is i32 != 2, since the guarantee will simply be i32 == 2

compute::greater_equal(field_expr, compute::literal(std::move(min)));
auto upper_bound = compute::less_equal(field_expr, compute::literal(std::move(max)));

if (statistics->null_count() == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this meant to be != 0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops

compute::CastOptions::Safe(std::move(to_type)));
}

Expression invert(Expression argument) { return call("invert", {std::move(argument)}); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this compute::not_?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, forgot about that factory

@bkietz
Copy link
Member Author

bkietz commented Jul 16, 2021

Closing for now

@bkietz bkietz closed this Jul 16, 2021
jonkeane pushed a commit that referenced this pull request Apr 21, 2022
This rebases #10253 and fixes it up to also address ARROW-15312, including a regression test.

This refactors how inequalities, is_valid, and is_null are treated in expression simplification, and updates the guarantees that the Parquet/Datasets emits for row groups to properly reflect nullability.

Closes #12891 from lidavidm/arrow-12659

Lead-authored-by: David Li <[email protected]>
Co-authored-by: Benjamin Kietzman <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Jonathan Keane <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants