Fix null comparison for Parquet pruning predicate by viirya · Pull Request #1595 · apache/datafusion

viirya · 2022-01-17T10:01:02Z

Which issue does this PR close?

Closes #1591.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

alamb · 2022-01-17T15:17:07Z

Thank you @viirya -- I will try to review this carefully, but likely won't be able to do so until tomorrow

viirya · 2022-01-17T18:24:49Z

Thank you @alamb

houqp

Thanks @viirya for the quick fix!

alamb

Thanks @viirya and @houqp

BLUF: I am fairly sure this change is ok, but I am not sure why it is needed; I have outlined my confusions below.

Note I ran this change against the IOx test suite in https://github.com/influxdata/influxdb_iox/pull/3479 and it was good.

Confusion 1: Doesn't follow definition of a pruning predicate:

As a reminder, the pruning predicate definition is

    /// A pruning predicate is one that has been rewritten in terms of
    /// the min and max values of column references and that evaluates
    /// to FALSE if the filter predicate would evaluate FALSE *for
    /// every row* whose values fell within the min / max ranges (aka
    /// could be pruned).
    ///
    /// The pruning predicate evaluates to TRUE or NULL
    /// if the filter predicate *might* evaluate to TRUE for at least
    /// one row whose vaules fell within the min/max ranges (in other
    /// words they might pass the predicate)

Thus, a "TRUE" or "NULL" result for a predicate means the row group must be kept. This is the safe behavior -- only if it is 100% certain that the predicate will evaluate to FALSE should the row group be removed

In this case, x = null doesn't seem to satisfy the stated conditions in PruningPredicate. x = null evaluates to null for all values (both null and non null) as can be seen in postgres:

alamb=# select x, x = null from foo;
 x | ?column?
---+----------
 1 |
   |
 2 |
(3 rows)

alamb=#

Thus there is something wrong. Either:

the pruning predicate definition should be updated to say that a pruning predicate will return false if all rows will evaluate to FALSE OR NULL (which seems reasonable as only rows that evaluate to TRUE pass a predicate, not row that return null)
this is not a correct transformation

Confusion 2: Why are we treating `=` specially?

If we go with this PR, I don't see any reason to handle = specially, as same argument applies to other operators such as !=, >, etc (though it does not apply to IS DISTINCT / IS NOT DISTINCT).

alamb · 2022-01-17T14:20:12Z

datafusion/src/physical_optimizer/pruning.rs

        &self.scalar_expr
    }

+    fn scalar_expr_value(&self) -> Result<&ScalarValue> {


Suggested change

fn scalar_expr_value(&self) -> Result<&ScalarValue> {

fn scalar_expr_value(&self) -> Option<&ScalarValue> {

Would save a string creation on error (not that it really matters)

alamb · 2022-01-18T18:32:28Z

datafusion/src/physical_optimizer/pruning.rs

+
+    fn null_count_column_expr(&mut self) -> Result<Expr> {
+        let null_count_field = &Field::new(self.field.name(), DataType::Int64, false);
+        self.required_columns.null_count_column_expr(


alamb · 2022-01-18T18:33:00Z

datafusion/src/physical_optimizer/pruning.rs

+                {
+                    // column = null => null_count > 0
+                    let null_count_column_expr = expr_builder.null_count_column_expr()?;
+                    null_count_column_expr.gt(lit::<i64>(0))


I am curious why we use a i64 here rather than u64?

Oh, you're right. This should be u64.

Changed to u64.

alamb · 2022-01-18T18:35:44Z

datafusion/src/physical_plan/file_format/parquet.rs

-        // because the null values propagate to the end result, making the predicate result undefined
-        assert_eq!(row_group_filter, vec![true, true]);
+        // First row group was filtered out because it contains no null value on "c2".
+        assert_eq!(row_group_filter, vec![false, true]);


I actually think this could be vec![false, false] as the predicate can never be true (int > 1 AND bool = NULL is always NULL)

I am not sure about the expression semantics in datafusion. In Spark, the predicate should be IsNull that checks the null value. Here I follow the original expression bool = NULL.

I see there is also IsNull predicate expression, but I don't see IsNull is handled in predicate pushdown. I don't know if this is intentional (i.e. using = to do null predicate pushdown) or a bug.

I can fix it if you agree that IsNull is correct way to handle null predicate here.

I think this is related to the "Confusion 1 and 2". I guess this is also why you feel confused about treating = specially.

In sql IsNull is the correct way to test a column for null as well 👍

It would make a lot of sense to me to rewrite x IS NULL --> 0 > x_null_count

yea, I'm surprised when I looked at the bool = NULL and confused too. I guess this is how datafusion works but seems not :). Let me fix it together.

Would you like me to fix it here or in a following PR?

I've updated to use IsNull for predicate pruning.

alamb · 2022-01-18T18:37:35Z

datafusion/src/physical_optimizer/pruning.rs

+
+    /// return the number of null values for the named column.
+    /// Note: the returned array must contain `num_containers()` rows.
+    fn null_counts(&self, column: &Column) -> Option<ArrayRef>;


Suggested change

/// return the number of null values for the named column.

/// Note: the returned array must contain `num_containers()` rows.

fn null_counts(&self, column: &Column) -> Option<ArrayRef>;

/// return the number of null values for the named column as an

/// `Option<Int64Array>`.

///

/// Note: the returned array must contain `num_containers()` rows.

fn null_counts(&self, column: &Column) -> Option<ArrayRef>;

I had to look this up to figure out what type this was required

alamb

Thanks @viirya -- looks good to me. @houqp is this what you had in mind for #1591 ?

houqp · 2022-01-21T07:19:29Z

Thank you @viirya for the fix and @alamb for the detailed review 👍

viirya · 2022-01-21T07:26:43Z

Thank you @houqp @alamb !

alamb · 2022-01-21T11:59:20Z

Thanks @houqp !

Fix null comparison for Parquet pruning predicate

49e38f5

github-actions bot added the datafusion label Jan 17, 2022

Fix clippy

416806f

houqp approved these changes Jan 17, 2022

View reviewed changes

houqp added the enhancement New feature or request label Jan 17, 2022

alamb approved these changes Jan 18, 2022

View reviewed changes

viirya added 2 commits January 18, 2022 23:19

Use u64

c9718cf

Address comments

eaedebb

viirya force-pushed the issue_1591 branch from 6044589 to 7501c18 Compare January 19, 2022 17:42

Use IsNull for null count predicate pruning

bc6b9b5

viirya force-pushed the issue_1591 branch from 7501c18 to bc6b9b5 Compare January 19, 2022 18:07

alamb approved these changes Jan 19, 2022

View reviewed changes

houqp added the performance Make DataFusion faster label Jan 21, 2022

houqp merged commit 03075d5 into apache:master Jan 21, 2022

alamb mentioned this pull request Aug 5, 2022

Error pruning IsNull expressions: Column 'instance_null_count' is declared as non-nullable but contains null values #3042

Closed

	fn scalar_expr_value(&self) -> Result<&ScalarValue> {
	fn scalar_expr_value(&self) -> Option<&ScalarValue> {

Conversation

viirya commented Jan 17, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

alamb commented Jan 17, 2022

Uh oh!

viirya commented Jan 17, 2022

Uh oh!

houqp left a comment

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Confusion 1: Doesn't follow definition of a pruning predicate:

Confusion 2: Why are we treating = specially?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Jan 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Jan 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

houqp commented Jan 21, 2022

Uh oh!

viirya commented Jan 21, 2022

Uh oh!

alamb commented Jan 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Confusion 2: Why are we treating `=` specially?

viirya Jan 18, 2022 •

edited

Loading

viirya Jan 18, 2022 •

edited

Loading