Fix null comparison for Parquet pruning predicate#1595
Conversation
|
Thank you @viirya -- I will try to review this carefully, but likely won't be able to do so until tomorrow |
|
Thank you @alamb |
alamb
left a comment
There was a problem hiding this comment.
BLUF: I am fairly sure this change is ok, but I am not sure why it is needed; I have outlined my confusions below.
Note I ran this change against the IOx test suite in https://github.com/influxdata/influxdb_iox/pull/3479 and it was good.
Confusion 1: Doesn't follow definition of a pruning predicate:
As a reminder, the pruning predicate definition is
/// A pruning predicate is one that has been rewritten in terms of
/// the min and max values of column references and that evaluates
/// to FALSE if the filter predicate would evaluate FALSE *for
/// every row* whose values fell within the min / max ranges (aka
/// could be pruned).
///
/// The pruning predicate evaluates to TRUE or NULL
/// if the filter predicate *might* evaluate to TRUE for at least
/// one row whose vaules fell within the min/max ranges (in other
/// words they might pass the predicate)
Thus, a "TRUE" or "NULL" result for a predicate means the row group must be kept. This is the safe behavior -- only if it is 100% certain that the predicate will evaluate to FALSE should the row group be removed
In this case, x = null doesn't seem to satisfy the stated conditions in PruningPredicate. x = null evaluates to null for all values (both null and non null) as can be seen in postgres:
alamb=# select x, x = null from foo;
x | ?column?
---+----------
1 |
|
2 |
(3 rows)
alamb=#
Thus there is something wrong. Either:
- the pruning predicate definition should be updated to say that a pruning predicate will return false if all rows will evaluate to FALSE OR NULL (which seems reasonable as only rows that evaluate to
TRUEpass a predicate, not row that returnnull) - this is not a correct transformation
Confusion 2: Why are we treating = specially?
If we go with this PR, I don't see any reason to handle = specially, as same argument applies to other operators such as !=, >, etc (though it does not apply to IS DISTINCT / IS NOT DISTINCT).
| &self.scalar_expr | ||
| } | ||
|
|
||
| fn scalar_expr_value(&self) -> Result<&ScalarValue> { |
There was a problem hiding this comment.
| fn scalar_expr_value(&self) -> Result<&ScalarValue> { | |
| fn scalar_expr_value(&self) -> Option<&ScalarValue> { |
Would save a string creation on error (not that it really matters)
|
|
||
| fn null_count_column_expr(&mut self) -> Result<Expr> { | ||
| let null_count_field = &Field::new(self.field.name(), DataType::Int64, false); | ||
| self.required_columns.null_count_column_expr( |
| { | ||
| // column = null => null_count > 0 | ||
| let null_count_column_expr = expr_builder.null_count_column_expr()?; | ||
| null_count_column_expr.gt(lit::<i64>(0)) |
There was a problem hiding this comment.
I am curious why we use a i64 here rather than u64?
There was a problem hiding this comment.
Oh, you're right. This should be u64.
| // because the null values propagate to the end result, making the predicate result undefined | ||
| assert_eq!(row_group_filter, vec![true, true]); | ||
| // First row group was filtered out because it contains no null value on "c2". | ||
| assert_eq!(row_group_filter, vec![false, true]); |
There was a problem hiding this comment.
I actually think this could be vec![false, false] as the predicate can never be true (int > 1 AND bool = NULL is always NULL)
There was a problem hiding this comment.
I am not sure about the expression semantics in datafusion. In Spark, the predicate should be IsNull that checks the null value. Here I follow the original expression bool = NULL.
I see there is also IsNull predicate expression, but I don't see IsNull is handled in predicate pushdown. I don't know if this is intentional (i.e. using = to do null predicate pushdown) or a bug.
I can fix it if you agree that IsNull is correct way to handle null predicate here.
There was a problem hiding this comment.
I think this is related to the "Confusion 1 and 2". I guess this is also why you feel confused about treating = specially.
There was a problem hiding this comment.
In sql IsNull is the correct way to test a column for null as well 👍
It would make a lot of sense to me to rewrite x IS NULL --> 0 > x_null_count
There was a problem hiding this comment.
yea, I'm surprised when I looked at the bool = NULL and confused too. I guess this is how datafusion works but seems not :). Let me fix it together.
There was a problem hiding this comment.
Would you like me to fix it here or in a following PR?
There was a problem hiding this comment.
I've updated to use IsNull for predicate pruning.
|
|
||
| /// return the number of null values for the named column. | ||
| /// Note: the returned array must contain `num_containers()` rows. | ||
| fn null_counts(&self, column: &Column) -> Option<ArrayRef>; |
There was a problem hiding this comment.
| /// return the number of null values for the named column. | |
| /// Note: the returned array must contain `num_containers()` rows. | |
| fn null_counts(&self, column: &Column) -> Option<ArrayRef>; | |
| /// return the number of null values for the named column as an | |
| /// `Option<Int64Array>`. | |
| /// | |
| /// Note: the returned array must contain `num_containers()` rows. | |
| fn null_counts(&self, column: &Column) -> Option<ArrayRef>; |
I had to look this up to figure out what type this was required
|
Thanks @houqp ! |
Which issue does this PR close?
Closes #1591.
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?