ARROW-12659: [C++] Support is_valid as a guarantee by lidavidm · Pull Request #12891 · apache/arrow

lidavidm · 2022-04-14T17:03:27Z

This rebases #10253 and fixes it up to also address ARROW-15312, including a regression test.

This refactors how inequalities, is_valid, and is_null are treated in expression simplification, and updates the guarantees that the Parquet/Datasets emits for row groups to properly reflect nullability.

github-actions · 2022-04-14T17:03:49Z

https://issues.apache.org/jira/browse/ARROW-12659

github-actions · 2022-04-14T17:03:50Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

lidavidm · 2022-04-14T20:45:48Z

Hmm, it's crashing because A == null[string] is getting converted to null[string] by FoldConstants. But FilterNode assumes the mask has to be boolean.

lidavidm · 2022-04-14T20:47:55Z

This needs to return a datum of the right type:

arrow/cpp/src/arrow/compute/exec/expression.cc

Lines 632 to 640 in 63d2a9c

    
           if (GetNullHandling(*call) == compute::NullHandling::INTERSECTION) { 
        
             // kernels which always produce intersected validity can be resolved 
        
             // to null *now* if any of their inputs is a null literal 
        
             for (const auto& argument : call->arguments) { 
        
               if (argument.IsNullLiteral()) { 
        
                 return argument; 
        
               } 
        
             } 
        
           }

wjones127

Nice to have this picked up again. One question and two minor comments.

wjones127 · 2022-04-15T15:40:04Z

cpp/src/arrow/compute/exec/expression.cc

Do we want to provide more context here?

I turned this into a docstring.

wjones127 · 2022-04-15T15:44:39Z

cpp/src/arrow/compute/exec/expression.cc

nit: I don't love that this is different from below function names by only a single character. Maybe something like ExtractSingleFieldvalue or something similar?

wjones127 · 2022-04-15T16:43:29Z

cpp/src/arrow/dataset/file_parquet.cc

Do we want an arm for statitics->null_count() == row_count? In which case, we would just return is_null(field_expr)?

I believe this is handled at

arrow/cpp/src/arrow/dataset/file_parquet.cc

Lines 118 to 121 in fae66cb

// Optimize for corner case where all values are nulls

if (statistics->num_values() == 0 && statistics->null_count() > 0) {

return is_null(std::move(field_expr));

}

Confusingly enough num_values does not include nulls. See this test which covers this already: https://github.com/apache/arrow/pull/12891/files#diff-d88654840d0432223c1617e8fd9289db0f4e6fff6b34e9f062861ef8eec724fcR256

This writes each record batch to its own row group, so the test would fail if we didn't generate the proper guarantee for the all-null row group.

Ah thanks for the pointer on num_values.

commit f482049 Author: Benjamin Kietzman <[email protected]> Date: Fri May 7 17:22:01 2021 -0400 sketch of 'correct' nullable caveats in inequality guarantees commit 0bc5b4d Author: Benjamin Kietzman <[email protected]> Date: Thu May 6 10:28:45 2021 -0400 add is_valid() guarantee to column statistics expr commit 212847e Author: Benjamin Kietzman <[email protected]> Date: Wed May 5 16:02:30 2021 -0400 ARROW-12659: [C++][Compute] Support is_valid as a guarantee

lidavidm · 2022-04-19T13:22:20Z

CC @pitrou or @westonpace, any comments here?

pitrou

Thanks a lot @lidavidm , some comments below.

cpp/src/arrow/compute/kernels/scalar_validity_test.cc

pitrou · 2022-04-19T14:01:39Z

cpp/src/arrow/util/vector.h

-  auto new_end =
-      std::remove_if(values.begin(), values.end(), std::forward<Predicate>(predicate));
+std::vector<T> FilterVector(std::vector<T> values, Predicate&& predicate,
+                            std::vector<T>* filtered_out = NULLPTR) {


I may be missing something, but it does not seem this third argument is used anywhere?

It's not used indeed. But the change is still needed since it actually inverts what FilterVector does! (The current FilterVector is backwards of what you would expect…)

pitrou · 2022-04-19T14:04:32Z

cpp/src/arrow/dataset/file_parquet.cc

+      auto single_value = compute::equal(field_expr, compute::literal(std::move(min)));
+
+      if (statistics->null_count() == 0) {
+        return compute::and_(single_value, compute::is_valid(field_expr));


Is it useful to add is_valid here? If a value is equal to min it implies it is valid.

Removing this does break a test, but it's because right now i64 > 1 doesn't cause is_null(i64) to simplify - we need to be a little smarter here. Will fix that.

pitrou · 2022-04-19T14:05:05Z

cpp/src/arrow/dataset/file_parquet.cc

+    min = maybe_min.MoveValueUnsafe();
+    max = maybe_max.MoveValueUnsafe();
+
+    compute::Expression range;


This variable doesn't seem used?

cpp/src/arrow/dataset/file_parquet.cc

pitrou · 2022-04-19T14:44:49Z

cpp/src/arrow/compute/exec/expression.cc

+    }
+
+    if (guarantee.cmp & cmp_rhs_bound) {
+      // x > 1, x >= 1, x != 1 cannot use guarantee x >= 3


This is contradicted by the next comment below, did you make a mistake?

Perhaps (with rhs being 1 and bound being 0):

// x > 1, x >= 1, x != 1 cannot use guarantee x >= 0 // (where `guarantee.cmp` is GREATER_EQUAL, `cmp_rhs_bound` is GREATER)

pitrou · 2022-04-19T14:50:15Z

cpp/src/arrow/compute/exec/expression.cc

+      return expr;
+    }
+
+    if (guarantee.cmp & cmp_rhs_bound) {


This is cryptic, what is this condition supposed to imply?

It's unclear to me after writing out some examples (I don't think it handles all cases right either as you note with the incorrect comment)

I'll try to replace this

Or actually, I straight up just don't understand this…will try to clear this up…

Alright, added comments here and for the other feedback to try to clarify things. This conditional is surprisingly subtle…

pitrou · 2022-04-19T14:51:51Z

cpp/src/arrow/compute/exec/expression.cc

+  Comparison::type cmp;
+  const FieldRef& target;
+  const Datum& bound;


Would you like to add a comment explaining what the terms are? Is it target <cmp> bound or bound <cmp> target?

pitrou · 2022-04-19T14:52:11Z

cpp/src/arrow/compute/exec/expression.cc

+  Comparison::type cmp;
+  const FieldRef& target;
+  const Datum& bound;
+  bool nullable;


Is this "the target can be null"?

pitrou · 2022-04-19T15:06:36Z

cpp/src/arrow/compute/exec/expression.cc

+    }
+
+    if (*cmp & Comparison::GetFlipped(cmp_rhs_bound)) {
+      // x > 1, x >= 1, x != 1 guaranteed by x >= 3


Perhaps

Suggested change

// x > 1, x >= 1, x != 1 guaranteed by x >= 3

// x > 1, x >= 1, x != 1 guaranteed by x >= 3

// (where `guarantee.cmp` is GREATER_EQUAL, `cmp_rhs_bound` is LESS)

Co-authored-by: Antoine Pitrou <[email protected]>

westonpace

This looks great, thanks for figuring this out. It seems there would be some advantage whenever I filter parquet files with an equality to add is_valid if that column might contain nulls. For example:

(ds.field(x) < 10) & is_valid(ds.field(x)) will eliminate a row group with min 12 and null_count > 0 where ds.field(x) < 10 will not (although the filtering will be very fast we will still have to decode the row group).

I don't know if this is worth documenting somewhere or if it is too obscure to include.

westonpace · 2022-04-19T21:22:22Z

cpp/src/arrow/compute/exec/expression.cc

+
+      if ((*cmp & guarantee.cmp) == 0) {
+        // guarantee disjoint with filter, so all data will be excluded
+        // x > 1, x >= 1, x != 1 unsatisfiable if x == 1


x >= 1 is satisfiable if x == 1 (those two are not disjoint).

Indeed, fixed.

westonpace · 2022-04-19T22:02:53Z

cpp/src/arrow/compute/exec/expression.cc

+    }
+
+    if (*cmp & Comparison::GetFlipped(cmp_rhs_bound)) {
+      // x > 1, x >= 1, x != 1 guaranteed by x >= 3


This is hard to reason but I agree with the conclusions :)

x > 3, x >= 3 always true if guaranteed x > 5 * cmp_rhs_bound will be < * cmp will be > or >= * guarantee.cmp will be > * Your logic simplifies to true (correct) x < 5, x <= 5 always true if guaranteed x < 2 * cmp_rhs_bound will be > * cmp will be < or <= * guarantee.cmp will be < * Your logic simplifies to true (correct) x != 5 always true if guaranteed x < 3 or x > 7 * cmp_rhs_bound will be > or < * cmp will be <> * guarantee.cmp will be < or > (always disjoint with cmp_rhs_bound) * Your logic simplifies to true (correct) x > 5, x >= 5 always false if guaranteed x < 3 * cmp_rhs_bound is > * cmp is > or >= * guarantee.cmp is < * Your logic simplifies to false (correct) x < 5, x <= 5 always false if guaranteed x > 7 * cmp_rhs_bound is < * cmp is < or <= * guarantee.cmp is > * Your logic simplifies to false (correct) x == 5 always false if guaranteed x > 7 or x < 3 * cmp_rhs_bound is < or > * cmp is == * guarantee.cmp is > or < (always disjoint with cmp_rhs_bound) * Your logic simplifies to false (correct)

lidavidm · 2022-04-19T22:43:27Z

This looks great, thanks for figuring this out. It seems there would be some advantage whenever I filter parquet files with an equality to add is_valid if that column might contain nulls. For example:

(ds.field(x) < 10) & is_valid(ds.field(x)) will eliminate a row group with min 12 and null_count > 0 where ds.field(x) < 10 will not (although the filtering will be very fast we will still have to decode the row group).

I don't know if this is worth documenting somewhere or if it is too obscure to include.

Hmm. I guess we are treating guarantees and filters differently. x < 10 as a guarantee implies is_valid(x), but not as a filter. We may want to fix that, but that would also be a drastic change.

lidavidm · 2022-04-19T22:46:54Z

(Also note @bkietz should get author credit here, I'm just fixing up his old PR and adding some comments/tests)

westonpace · 2022-04-19T22:51:30Z

Hmm. I guess we are treating guarantees and filters differently. x < 10 as a guarantee implies is_valid(x), but not as a filter. We may want to fix that, but that would also be a drastic change.

Sorry, I wasn't clear. I don't think we should automatically add is_valid(x). I'm just wondering if this is something we ought to document since it may not be intuitive to users. I can probably add something when we talk about pushdown filtering in the datasets docs.

lidavidm · 2022-04-19T23:00:37Z

Ah - yes, it would be good to make explicit (at least, it would be good to document how we handle nullability in general here)

pitrou · 2022-04-20T07:16:54Z

Hmm. I guess we are treating guarantees and filters differently. x < 10 as a guarantee implies is_valid(x), but not as a filter. We may want to fix that, but that would also be a drastic change.

I think it would definitely be worthwhile to fix it. It's delicate enough to think about simplifications without this oddity.

Also, isn't it surprising as a user for the x < 10 filter to accept nulls? It wouldn't in SQL for example.

lidavidm · 2022-04-20T13:09:08Z

Actually, right: I think we already do the right thing.

(i32 < 10) with the guarantee ((i32 >= 12) or is_null(i32, {nan_is_null=false})) simplifies to invert(true_unless_null(i32)) which is not satisfiable, so the row group will be pruned. (We could add a pass to simplify it all the way down to literal(false) in the first place if we want.)

pitrou · 2022-04-20T13:45:30Z

Ah, great!

westonpace · 2022-04-20T20:03:22Z

invert(true_unless_null(i32)) being not satisfiable is probably more correct than simplifying to literal(false) so I wouldn't recommend simplifying to literal(false). I agree, it sounds like we are doing the right thing, sorry for the wild goose chase.

lidavidm · 2022-04-21T16:15:52Z

I think I've addressed everything now, any other comments here?

pitrou

Thanks a lot! I have a question but this can be merged for 9.0.0.

pitrou · 2022-04-21T16:24:26Z

cpp/src/arrow/util/vector.h

 std::vector<T> FilterVector(std::vector<T> values, Predicate&& predicate) {
-  auto new_end =
-      std::remove_if(values.begin(), values.end(), std::forward<Predicate>(predicate));
+  auto new_end = std::stable_partition(values.begin(), values.end(),


I'm curious, is there any reason not to use remove_if?

We want a (hypothetical) keep_if not remove_if though I suppose we could wrap predicate and invert it.

Yeah, inverting it would probably be slightly more efficient than calling a stable partition (which can allocate temporary memory AFAIU).

Follow up at #12949

jonkeane · 2022-04-21T18:54:25Z

I'm going to merge + create a test in R to (double) confirm that the intended behavior there is fixed — we could use that PR if we need any cleanups

… some rows The real fix was in #12891 ([ARROW-12659](https://issues.apache.org/jira/browse/ARROW-12659)) but this adds integration tests from the ticket to confirm this works in R + we don't run into this in the future Closes #12950 from jonkeane/ARROW-15312 Lead-authored-by: Jonathan Keane <[email protected]> Co-authored-by: Neal Richardson <[email protected]> Signed-off-by: Neal Richardson <[email protected]>

Quick follow up to #12891 Closes #12949 from lidavidm/arrow-12659 Authored-by: David Li <[email protected]> Signed-off-by: Yibo Cai <[email protected]>

ursabot · 2022-04-24T14:31:03Z

github-actions bot added the Component: C++ label Apr 14, 2022

wjones127 reviewed Apr 15, 2022

View reviewed changes

bkietz and others added 8 commits April 18, 2022 10:57

ARROW-12659: [C++] Support is_valid in guarantees

f95116e

ARROW-15312: [C++] Make Parquet guarantees more precise

6e70e35

ARROW-12659: [C++] Simplify true_unless_null

bd155ad

ARROW-15312: [C++] Add test to other formats

a6fd734

ARROW-12659: [C++] Fix test

d510112

ARROW-12659: [C++] Fix simplify equals(A, null[string]) == null[bool]

c86f474

ARROW-12659: [C++] Clarify function naming

37f8605

lidavidm force-pushed the arrow-12659 branch from 3f009ab to 37f8605 Compare April 18, 2022 14:58

wjones127 approved these changes Apr 18, 2022

View reviewed changes

pitrou reviewed Apr 19, 2022

View reviewed changes

lidavidm and others added 5 commits April 19, 2022 15:37

Update cpp/src/arrow/dataset/file_parquet.cc

df9c8b4

Co-authored-by: Antoine Pitrou <[email protected]>

ARROW-12659: [C++] Address some review comments

e38e3f4

ARROW-12659: [C++] Simplify is_valid/is_null against inequalities

ba7901b

ARROW-12659: [C++] Handle non-kleene and in IsSatisfiable

943e6c9

ARROW-12659: [C++] Add doc comments

b25b9f6

westonpace reviewed Apr 19, 2022

View reviewed changes

ARROW-12659: [C++] Fix comment

da43f7d

pitrou approved these changes Apr 21, 2022

View reviewed changes

jonkeane closed this in 0e03af4 Apr 21, 2022

lidavidm deleted the arrow-12659 branch April 21, 2022 19:16

lidavidm mentioned this pull request Apr 21, 2022

MINOR: [C++] Use remove_if #12949

Closed

jonkeane mentioned this pull request Apr 21, 2022

ARROW-15312: [R][C++] filtering a Parquet dataset with is.na() misses some rows #12950

Closed

cyb70289 pushed a commit that referenced this pull request Apr 24, 2022

MINOR: [C++] Use remove_if

0b06870

Quick follow up to #12891 Closes #12949 from lidavidm/arrow-12659 Authored-by: David Li <[email protected]> Signed-off-by: Yibo Cai <[email protected]>

This was referenced Apr 24, 2022

[C++][Compute] Support SimplifyWithGuarantee(is_null(foo), is_valid(foo)) #28408

Closed

[R][C++] filtering a Parquet dataset with is.na() misses some rows #20068

Closed

	// Optimize for corner case where all values are nulls
	if (statistics->num_values() == 0 && statistics->null_count() > 0) {
	return is_null(std::move(field_expr));
	}

	// x > 1, x >= 1, x != 1 guaranteed by x >= 3
	// x > 1, x >= 1, x != 1 guaranteed by x >= 3
	// (where `guarantee.cmp` is GREATER_EQUAL, `cmp_rhs_bound` is LESS)

Conversation

lidavidm commented Apr 14, 2022

Uh oh!

github-actions bot commented Apr 14, 2022

Uh oh!

github-actions bot commented Apr 14, 2022

Uh oh!

lidavidm commented Apr 14, 2022

Uh oh!

lidavidm commented Apr 14, 2022

Uh oh!

wjones127 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lidavidm commented Apr 19, 2022

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lidavidm commented Apr 19, 2022

Uh oh!

lidavidm commented Apr 19, 2022

Uh oh!

westonpace commented Apr 19, 2022

Uh oh!

lidavidm commented Apr 20, 2022 •

edited

Loading

westonpace commented Apr 20, 2022 •

edited

Loading

lidavidm Apr 21, 2022 •

edited

Loading