Skip to content

ARROW-12659: [C++] Support is_valid as a guarantee#12891

Closed
lidavidm wants to merge 14 commits intoapache:masterfrom
lidavidm:arrow-12659
Closed

ARROW-12659: [C++] Support is_valid as a guarantee#12891
lidavidm wants to merge 14 commits intoapache:masterfrom
lidavidm:arrow-12659

Conversation

@lidavidm
Copy link
Member

This rebases #10253 and fixes it up to also address ARROW-15312, including a regression test.

This refactors how inequalities, is_valid, and is_null are treated in expression simplification, and updates the guarantees that the Parquet/Datasets emits for row groups to properly reflect nullability.

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@lidavidm
Copy link
Member Author

Hmm, it's crashing because A == null[string] is getting converted to null[string] by FoldConstants. But FilterNode assumes the mask has to be boolean.

@lidavidm
Copy link
Member Author

This needs to return a datum of the right type:

if (GetNullHandling(*call) == compute::NullHandling::INTERSECTION) {
// kernels which always produce intersected validity can be resolved
// to null *now* if any of their inputs is a null literal
for (const auto& argument : call->arguments) {
if (argument.IsNullLiteral()) {
return argument;
}
}
}

Copy link
Member

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to have this picked up again. One question and two minor comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to provide more context here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I turned this into a docstring.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't love that this is different from below function names by only a single character. Maybe something like ExtractSingleFieldvalue or something similar?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want an arm for statitics->null_count() == row_count? In which case, we would just return is_null(field_expr)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is handled at

// Optimize for corner case where all values are nulls
if (statistics->num_values() == 0 && statistics->null_count() > 0) {
return is_null(std::move(field_expr));
}

Confusingly enough num_values does not include nulls. See this test which covers this already: https://github.com/apache/arrow/pull/12891/files#diff-d88654840d0432223c1617e8fd9289db0f4e6fff6b34e9f062861ef8eec724fcR256

This writes each record batch to its own row group, so the test would fail if we didn't generate the proper guarantee for the all-null row group.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks for the pointer on num_values.

bkietz and others added 8 commits April 18, 2022 10:57
commit f482049
Author: Benjamin Kietzman <[email protected]>
Date:   Fri May 7 17:22:01 2021 -0400

    sketch of 'correct' nullable caveats in inequality guarantees

commit 0bc5b4d
Author: Benjamin Kietzman <[email protected]>
Date:   Thu May 6 10:28:45 2021 -0400

    add is_valid() guarantee to column statistics expr

commit 212847e
Author: Benjamin Kietzman <[email protected]>
Date:   Wed May 5 16:02:30 2021 -0400

    ARROW-12659: [C++][Compute] Support is_valid as a guarantee
@lidavidm
Copy link
Member Author

CC @pitrou or @westonpace, any comments here?

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @lidavidm , some comments below.

auto new_end =
std::remove_if(values.begin(), values.end(), std::forward<Predicate>(predicate));
std::vector<T> FilterVector(std::vector<T> values, Predicate&& predicate,
std::vector<T>* filtered_out = NULLPTR) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be missing something, but it does not seem this third argument is used anywhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not used indeed. But the change is still needed since it actually inverts what FilterVector does! (The current FilterVector is backwards of what you would expect…)

auto single_value = compute::equal(field_expr, compute::literal(std::move(min)));

if (statistics->null_count() == 0) {
return compute::and_(single_value, compute::is_valid(field_expr));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it useful to add is_valid here? If a value is equal to min it implies it is valid.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this does break a test, but it's because right now i64 > 1 doesn't cause is_null(i64) to simplify - we need to be a little smarter here. Will fix that.

min = maybe_min.MoveValueUnsafe();
max = maybe_max.MoveValueUnsafe();

compute::Expression range;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable doesn't seem used?

}

if (guarantee.cmp & cmp_rhs_bound) {
// x > 1, x >= 1, x != 1 cannot use guarantee x >= 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is contradicted by the next comment below, did you make a mistake?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps (with rhs being 1 and bound being 0):

// x > 1, x >= 1, x != 1 cannot use guarantee x >= 0
// (where `guarantee.cmp` is GREATER_EQUAL, `cmp_rhs_bound` is GREATER)

return expr;
}

if (guarantee.cmp & cmp_rhs_bound) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cryptic, what is this condition supposed to imply?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear to me after writing out some examples (I don't think it handles all cases right either as you note with the incorrect comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to replace this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or actually, I straight up just don't understand this…will try to clear this up…

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, added comments here and for the other feedback to try to clarify things. This conditional is surprisingly subtle…

Comment on lines 924 to 926
Comparison::type cmp;
const FieldRef& target;
const Datum& bound;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like to add a comment explaining what the terms are? Is it target <cmp> bound or bound <cmp> target?

Comparison::type cmp;
const FieldRef& target;
const Datum& bound;
bool nullable;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this "the target can be null"?

}

if (*cmp & Comparison::GetFlipped(cmp_rhs_bound)) {
// x > 1, x >= 1, x != 1 guaranteed by x >= 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps

Suggested change
// x > 1, x >= 1, x != 1 guaranteed by x >= 3
// x > 1, x >= 1, x != 1 guaranteed by x >= 3
// (where `guarantee.cmp` is GREATER_EQUAL, `cmp_rhs_bound` is LESS)

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thanks for figuring this out. It seems there would be some advantage whenever I filter parquet files with an equality to add is_valid if that column might contain nulls. For example:

(ds.field(x) < 10) & is_valid(ds.field(x)) will eliminate a row group with min 12 and null_count > 0 where ds.field(x) < 10 will not (although the filtering will be very fast we will still have to decode the row group).

I don't know if this is worth documenting somewhere or if it is too obscure to include.


if ((*cmp & guarantee.cmp) == 0) {
// guarantee disjoint with filter, so all data will be excluded
// x > 1, x >= 1, x != 1 unsatisfiable if x == 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x >= 1 is satisfiable if x == 1 (those two are not disjoint).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, fixed.

}

if (*cmp & Comparison::GetFlipped(cmp_rhs_bound)) {
// x > 1, x >= 1, x != 1 guaranteed by x >= 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hard to reason but I agree with the conclusions :)

x > 3, x >= 3 always true if guaranteed x > 5
  * cmp_rhs_bound will be <
  * cmp will be > or >=
  * guarantee.cmp will be >
  * Your logic simplifies to true (correct)
x < 5, x <= 5 always true if guaranteed x < 2
  * cmp_rhs_bound will be >
  * cmp will be < or <=
  * guarantee.cmp will be <
  * Your logic simplifies to true (correct)
x != 5 always true if guaranteed x < 3 or x > 7
  * cmp_rhs_bound will be > or <
  * cmp will be <>
  * guarantee.cmp will be < or > (always disjoint with cmp_rhs_bound)
  * Your logic simplifies to true (correct)
x > 5, x >= 5 always false if guaranteed x < 3
  * cmp_rhs_bound is >
  * cmp is > or >=
  * guarantee.cmp is <
  * Your logic simplifies to false (correct)
x < 5, x <= 5 always false if guaranteed x > 7
  * cmp_rhs_bound is <
  * cmp is < or <=
  * guarantee.cmp is >
  * Your logic simplifies to false (correct)
x == 5 always false if guaranteed x > 7 or x < 3
  * cmp_rhs_bound is < or >
  * cmp is ==
  * guarantee.cmp is > or < (always disjoint with cmp_rhs_bound)
  * Your logic simplifies to false (correct)

@lidavidm
Copy link
Member Author

This looks great, thanks for figuring this out. It seems there would be some advantage whenever I filter parquet files with an equality to add is_valid if that column might contain nulls. For example:

(ds.field(x) < 10) & is_valid(ds.field(x)) will eliminate a row group with min 12 and null_count > 0 where ds.field(x) < 10 will not (although the filtering will be very fast we will still have to decode the row group).

I don't know if this is worth documenting somewhere or if it is too obscure to include.

Hmm. I guess we are treating guarantees and filters differently. x < 10 as a guarantee implies is_valid(x), but not as a filter. We may want to fix that, but that would also be a drastic change.

@lidavidm
Copy link
Member Author

(Also note @bkietz should get author credit here, I'm just fixing up his old PR and adding some comments/tests)

@westonpace
Copy link
Member

Hmm. I guess we are treating guarantees and filters differently. x < 10 as a guarantee implies is_valid(x), but not as a filter. We may want to fix that, but that would also be a drastic change.

Sorry, I wasn't clear. I don't think we should automatically add is_valid(x). I'm just wondering if this is something we ought to document since it may not be intuitive to users. I can probably add something when we talk about pushdown filtering in the datasets docs.

@lidavidm
Copy link
Member Author

Ah - yes, it would be good to make explicit (at least, it would be good to document how we handle nullability in general here)

@pitrou
Copy link
Member

pitrou commented Apr 20, 2022

Hmm. I guess we are treating guarantees and filters differently. x < 10 as a guarantee implies is_valid(x), but not as a filter. We may want to fix that, but that would also be a drastic change.

I think it would definitely be worthwhile to fix it. It's delicate enough to think about simplifications without this oddity.

Also, isn't it surprising as a user for the x < 10 filter to accept nulls? It wouldn't in SQL for example.

@lidavidm
Copy link
Member Author

lidavidm commented Apr 20, 2022

Actually, right: I think we already do the right thing.

(i32 < 10) with the guarantee ((i32 >= 12) or is_null(i32, {nan_is_null=false})) simplifies to invert(true_unless_null(i32)) which is not satisfiable, so the row group will be pruned. (We could add a pass to simplify it all the way down to literal(false) in the first place if we want.)

@pitrou
Copy link
Member

pitrou commented Apr 20, 2022

Ah, great!

@westonpace
Copy link
Member

westonpace commented Apr 20, 2022

invert(true_unless_null(i32)) being not satisfiable is probably more correct than simplifying to literal(false) so I wouldn't recommend simplifying to literal(false). I agree, it sounds like we are doing the right thing, sorry for the wild goose chase.

@lidavidm
Copy link
Member Author

I think I've addressed everything now, any other comments here?

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot! I have a question but this can be merged for 9.0.0.

std::vector<T> FilterVector(std::vector<T> values, Predicate&& predicate) {
auto new_end =
std::remove_if(values.begin(), values.end(), std::forward<Predicate>(predicate));
auto new_end = std::stable_partition(values.begin(), values.end(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, is there any reason not to use remove_if?

Copy link
Member Author

@lidavidm lidavidm Apr 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want a (hypothetical) keep_if not remove_if though I suppose we could wrap predicate and invert it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, inverting it would probably be slightly more efficient than calling a stable partition (which can allocate temporary memory AFAIU).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow up at #12949

@jonkeane
Copy link
Member

I'm going to merge + create a test in R to (double) confirm that the intended behavior there is fixed — we could use that PR if we need any cleanups

@jonkeane jonkeane closed this in 0e03af4 Apr 21, 2022
@lidavidm lidavidm deleted the arrow-12659 branch April 21, 2022 19:16
nealrichardson added a commit that referenced this pull request Apr 22, 2022
… some rows

The real fix was in #12891 ([ARROW-12659](https://issues.apache.org/jira/browse/ARROW-12659)) but this adds integration tests from the ticket to confirm this works in R + we don't run into this in the future

Closes #12950 from jonkeane/ARROW-15312

Lead-authored-by: Jonathan Keane <[email protected]>
Co-authored-by: Neal Richardson <[email protected]>
Signed-off-by: Neal Richardson <[email protected]>
cyb70289 pushed a commit that referenced this pull request Apr 24, 2022
Quick follow up to #12891

Closes #12949 from lidavidm/arrow-12659

Authored-by: David Li <[email protected]>
Signed-off-by: Yibo Cai <[email protected]>
@ursabot
Copy link

ursabot commented Apr 24, 2022

Benchmark runs are scheduled for baseline = 20ec0fd and contender = 0e03af4. 0e03af4 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed] test-mac-arm
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.42% ⬆️0.34%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/576| 0e03af44 ec2-t3-xlarge-us-east-2>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/564| 0e03af44 test-mac-arm>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/562| 0e03af44 ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/574| 0e03af44 ursa-thinkcentre-m75q>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/575| 20ec0fda ec2-t3-xlarge-us-east-2>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/563| 20ec0fda test-mac-arm>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/561| 20ec0fda ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/573| 20ec0fda ursa-thinkcentre-m75q>
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants