ARROW-12659: [C++] Support is_valid as a guarantee#12891
ARROW-12659: [C++] Support is_valid as a guarantee#12891lidavidm wants to merge 14 commits intoapache:masterfrom
Conversation
|
|
|
Hmm, it's crashing because |
|
This needs to return a datum of the right type: arrow/cpp/src/arrow/compute/exec/expression.cc Lines 632 to 640 in 63d2a9c |
wjones127
left a comment
There was a problem hiding this comment.
Nice to have this picked up again. One question and two minor comments.
There was a problem hiding this comment.
Do we want to provide more context here?
There was a problem hiding this comment.
I turned this into a docstring.
There was a problem hiding this comment.
nit: I don't love that this is different from below function names by only a single character. Maybe something like ExtractSingleFieldvalue or something similar?
There was a problem hiding this comment.
Do we want an arm for statitics->null_count() == row_count? In which case, we would just return is_null(field_expr)?
There was a problem hiding this comment.
I believe this is handled at
arrow/cpp/src/arrow/dataset/file_parquet.cc
Lines 118 to 121 in fae66cb
Confusingly enough num_values does not include nulls. See this test which covers this already: https://github.com/apache/arrow/pull/12891/files#diff-d88654840d0432223c1617e8fd9289db0f4e6fff6b34e9f062861ef8eec724fcR256
This writes each record batch to its own row group, so the test would fail if we didn't generate the proper guarantee for the all-null row group.
There was a problem hiding this comment.
Ah thanks for the pointer on num_values.
commit f482049 Author: Benjamin Kietzman <[email protected]> Date: Fri May 7 17:22:01 2021 -0400 sketch of 'correct' nullable caveats in inequality guarantees commit 0bc5b4d Author: Benjamin Kietzman <[email protected]> Date: Thu May 6 10:28:45 2021 -0400 add is_valid() guarantee to column statistics expr commit 212847e Author: Benjamin Kietzman <[email protected]> Date: Wed May 5 16:02:30 2021 -0400 ARROW-12659: [C++][Compute] Support is_valid as a guarantee
|
CC @pitrou or @westonpace, any comments here? |
cpp/src/arrow/util/vector.h
Outdated
| auto new_end = | ||
| std::remove_if(values.begin(), values.end(), std::forward<Predicate>(predicate)); | ||
| std::vector<T> FilterVector(std::vector<T> values, Predicate&& predicate, | ||
| std::vector<T>* filtered_out = NULLPTR) { |
There was a problem hiding this comment.
I may be missing something, but it does not seem this third argument is used anywhere?
There was a problem hiding this comment.
It's not used indeed. But the change is still needed since it actually inverts what FilterVector does! (The current FilterVector is backwards of what you would expect…)
| auto single_value = compute::equal(field_expr, compute::literal(std::move(min))); | ||
|
|
||
| if (statistics->null_count() == 0) { | ||
| return compute::and_(single_value, compute::is_valid(field_expr)); |
There was a problem hiding this comment.
Is it useful to add is_valid here? If a value is equal to min it implies it is valid.
There was a problem hiding this comment.
Removing this does break a test, but it's because right now i64 > 1 doesn't cause is_null(i64) to simplify - we need to be a little smarter here. Will fix that.
| min = maybe_min.MoveValueUnsafe(); | ||
| max = maybe_max.MoveValueUnsafe(); | ||
|
|
||
| compute::Expression range; |
There was a problem hiding this comment.
This variable doesn't seem used?
| } | ||
|
|
||
| if (guarantee.cmp & cmp_rhs_bound) { | ||
| // x > 1, x >= 1, x != 1 cannot use guarantee x >= 3 |
There was a problem hiding this comment.
This is contradicted by the next comment below, did you make a mistake?
There was a problem hiding this comment.
Perhaps (with rhs being 1 and bound being 0):
// x > 1, x >= 1, x != 1 cannot use guarantee x >= 0
// (where `guarantee.cmp` is GREATER_EQUAL, `cmp_rhs_bound` is GREATER)| return expr; | ||
| } | ||
|
|
||
| if (guarantee.cmp & cmp_rhs_bound) { |
There was a problem hiding this comment.
This is cryptic, what is this condition supposed to imply?
There was a problem hiding this comment.
It's unclear to me after writing out some examples (I don't think it handles all cases right either as you note with the incorrect comment)
There was a problem hiding this comment.
I'll try to replace this
There was a problem hiding this comment.
Or actually, I straight up just don't understand this…will try to clear this up…
There was a problem hiding this comment.
Alright, added comments here and for the other feedback to try to clarify things. This conditional is surprisingly subtle…
| Comparison::type cmp; | ||
| const FieldRef& target; | ||
| const Datum& bound; |
There was a problem hiding this comment.
Would you like to add a comment explaining what the terms are? Is it target <cmp> bound or bound <cmp> target?
| Comparison::type cmp; | ||
| const FieldRef& target; | ||
| const Datum& bound; | ||
| bool nullable; |
There was a problem hiding this comment.
Is this "the target can be null"?
| } | ||
|
|
||
| if (*cmp & Comparison::GetFlipped(cmp_rhs_bound)) { | ||
| // x > 1, x >= 1, x != 1 guaranteed by x >= 3 |
There was a problem hiding this comment.
Perhaps
| // x > 1, x >= 1, x != 1 guaranteed by x >= 3 | |
| // x > 1, x >= 1, x != 1 guaranteed by x >= 3 | |
| // (where `guarantee.cmp` is GREATER_EQUAL, `cmp_rhs_bound` is LESS) |
Co-authored-by: Antoine Pitrou <[email protected]>
westonpace
left a comment
There was a problem hiding this comment.
This looks great, thanks for figuring this out. It seems there would be some advantage whenever I filter parquet files with an equality to add is_valid if that column might contain nulls. For example:
(ds.field(x) < 10) & is_valid(ds.field(x)) will eliminate a row group with min 12 and null_count > 0 where ds.field(x) < 10 will not (although the filtering will be very fast we will still have to decode the row group).
I don't know if this is worth documenting somewhere or if it is too obscure to include.
|
|
||
| if ((*cmp & guarantee.cmp) == 0) { | ||
| // guarantee disjoint with filter, so all data will be excluded | ||
| // x > 1, x >= 1, x != 1 unsatisfiable if x == 1 |
There was a problem hiding this comment.
x >= 1 is satisfiable if x == 1 (those two are not disjoint).
| } | ||
|
|
||
| if (*cmp & Comparison::GetFlipped(cmp_rhs_bound)) { | ||
| // x > 1, x >= 1, x != 1 guaranteed by x >= 3 |
There was a problem hiding this comment.
This is hard to reason but I agree with the conclusions :)
x > 3, x >= 3 always true if guaranteed x > 5
* cmp_rhs_bound will be <
* cmp will be > or >=
* guarantee.cmp will be >
* Your logic simplifies to true (correct)
x < 5, x <= 5 always true if guaranteed x < 2
* cmp_rhs_bound will be >
* cmp will be < or <=
* guarantee.cmp will be <
* Your logic simplifies to true (correct)
x != 5 always true if guaranteed x < 3 or x > 7
* cmp_rhs_bound will be > or <
* cmp will be <>
* guarantee.cmp will be < or > (always disjoint with cmp_rhs_bound)
* Your logic simplifies to true (correct)
x > 5, x >= 5 always false if guaranteed x < 3
* cmp_rhs_bound is >
* cmp is > or >=
* guarantee.cmp is <
* Your logic simplifies to false (correct)
x < 5, x <= 5 always false if guaranteed x > 7
* cmp_rhs_bound is <
* cmp is < or <=
* guarantee.cmp is >
* Your logic simplifies to false (correct)
x == 5 always false if guaranteed x > 7 or x < 3
* cmp_rhs_bound is < or >
* cmp is ==
* guarantee.cmp is > or < (always disjoint with cmp_rhs_bound)
* Your logic simplifies to false (correct)
Hmm. I guess we are treating guarantees and filters differently. |
|
(Also note @bkietz should get author credit here, I'm just fixing up his old PR and adding some comments/tests) |
Sorry, I wasn't clear. I don't think we should automatically add |
|
Ah - yes, it would be good to make explicit (at least, it would be good to document how we handle nullability in general here) |
I think it would definitely be worthwhile to fix it. It's delicate enough to think about simplifications without this oddity. Also, isn't it surprising as a user for the |
|
Actually, right: I think we already do the right thing.
|
|
Ah, great! |
|
|
|
I think I've addressed everything now, any other comments here? |
pitrou
left a comment
There was a problem hiding this comment.
Thanks a lot! I have a question but this can be merged for 9.0.0.
| std::vector<T> FilterVector(std::vector<T> values, Predicate&& predicate) { | ||
| auto new_end = | ||
| std::remove_if(values.begin(), values.end(), std::forward<Predicate>(predicate)); | ||
| auto new_end = std::stable_partition(values.begin(), values.end(), |
There was a problem hiding this comment.
I'm curious, is there any reason not to use remove_if?
There was a problem hiding this comment.
We want a (hypothetical) keep_if not remove_if though I suppose we could wrap predicate and invert it.
There was a problem hiding this comment.
Yeah, inverting it would probably be slightly more efficient than calling a stable partition (which can allocate temporary memory AFAIU).
|
I'm going to merge + create a test in R to (double) confirm that the intended behavior there is fixed — we could use that PR if we need any cleanups |
… some rows The real fix was in #12891 ([ARROW-12659](https://issues.apache.org/jira/browse/ARROW-12659)) but this adds integration tests from the ticket to confirm this works in R + we don't run into this in the future Closes #12950 from jonkeane/ARROW-15312 Lead-authored-by: Jonathan Keane <[email protected]> Co-authored-by: Neal Richardson <[email protected]> Signed-off-by: Neal Richardson <[email protected]>
Quick follow up to #12891 Closes #12949 from lidavidm/arrow-12659 Authored-by: David Li <[email protected]> Signed-off-by: Yibo Cai <[email protected]>
This rebases #10253 and fixes it up to also address ARROW-15312, including a regression test.
This refactors how inequalities, is_valid, and is_null are treated in expression simplification, and updates the guarantees that the Parquet/Datasets emits for row groups to properly reflect nullability.