ARROW-12659: [C++][Compute] Support is_valid as a guarantee#10253
ARROW-12659: [C++][Compute] Support is_valid as a guarantee#10253bkietz wants to merge 3 commits intoapache:masterfrom
Conversation
lidavidm
left a comment
There was a problem hiding this comment.
LGTM. Regarding guarantees, I think assuming nullability by default is fine, though we should perhaps be more thorough with testing with known null/non-null fields downstream in Datasets.
where this gets semantically sticky is: we use |
|
Ah. I guess we should document the semantics we expect - right now it seems we're mostly translating implicitly what people expect from Parquet/Hive which is fine, but imprecise. |
2225342 to
0bc5b4d
Compare
| min = maybe_min.MoveValueUnsafe(); | ||
| max = maybe_max.MoveValueUnsafe(); | ||
| if (min->Equals(max)) { | ||
| return compute::equal(std::move(field_expr), compute::literal(std::move(min))); |
There was a problem hiding this comment.
Do we want to add an is_valid here too if applicable?
There was a problem hiding this comment.
Equality expressions will be picked up by the ExtractKnownFieldValues pass, which assumes the field will min and never null (questionably correct)
Longer term: I think we need to be more rigorous with the guarantee contract. Specifically: any guarantee should be usable as a filter, and in that case will evaluate to an array containing only true (and no nulls).
| Data, Guarantee | Filter | Evaluation of guarantee on Data | |
|---|---|---|---|
| [2, 2], i32 == 2 | i32 == 2 | [true, true] | can simplify to true |
| [2, null], i32 == 2 or is_null(i32) | i32 == 2 | [true, null] | the second row should be excluded from the filter |
There was a problem hiding this comment.
Sounds good to me. I think a row group with [2, null] right now might actually get filtered out accidentally if your filter is i32 != 2, since the guarantee will simply be i32 == 2…
| compute::greater_equal(field_expr, compute::literal(std::move(min))); | ||
| auto upper_bound = compute::less_equal(field_expr, compute::literal(std::move(max))); | ||
|
|
||
| if (statistics->null_count() == 0) { |
| compute::CastOptions::Safe(std::move(to_type))); | ||
| } | ||
|
|
||
| Expression invert(Expression argument) { return call("invert", {std::move(argument)}); } |
There was a problem hiding this comment.
yes, forgot about that factory
|
Closing for now |
This rebases #10253 and fixes it up to also address ARROW-15312, including a regression test. This refactors how inequalities, is_valid, and is_null are treated in expression simplification, and updates the guarantees that the Parquet/Datasets emits for row groups to properly reflect nullability. Closes #12891 from lidavidm/arrow-12659 Lead-authored-by: David Li <[email protected]> Co-authored-by: Benjamin Kietzman <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Jonathan Keane <[email protected]>
This allows us to take advantage of a partition expression like
f32 >= 1 and f32 <= 2 and is_valid(f32), which could be generated for a row group with no null values.Note this implies that
f32 >= 1 and f32 <= 2is a guarantee that "f32 is between 1 and 2, or is null". This is probably fine vs the more pedantic(f32 >= 1 and f32 <= 2) or is_null(f32), but requires a potentially problematic addendum to the contract of guarantees: guarantees do not apply to null arguments unless an explicit is_null or is_valid is included