Infer count() aggregation is not null#11256
Conversation
f346b75 to
2cba7df
Compare
This comment was marked as resolved.
This comment was marked as resolved.
2cba7df to
37019a5
Compare
f9dadfe to
fe8fdef
Compare
|
Added unit and integration tests. Should be ready to review. |
fe8fdef to
d839797
Compare
d839797 to
0352eda
Compare
|
@alamb PTAL |
|
Thank you very much @findepi -- for both this PR and your other recent contributions. I will indeed review this PR, though I may not get to it until tomorrow or Friday |
|
I think there might be a simpler way to achieve this. We can declare the count aggregation expr as non-nullable like this, and then the datafusion/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs Lines 1450 to 1454 in 6e63748 |
|
@jonahgao thanks! |
0352eda to
ba27f93
Compare
| match func_def { | ||
| // TODO: min/max over non-nullable input should be recognized as non-nullable | ||
| AggregateFunctionDefinition::BuiltIn(fun) => fun.nullable(), | ||
| // TODO: UDF should be able to customize nullability |
There was a problem hiding this comment.
To address this TODO we could extend the pattern here
datafusion/datafusion/expr/src/aggregate_function.rs
Lines 84 to 88 in 4bc3228
ie now we have return_type(input_types, nullability) -> return_type
we could have return_type(input_types, nullability) -> (return_type, nullability)
this is an API for builtin functions, we would need similar API for UDAFs
datafusion/datafusion/expr/src/udaf.rs
Line 330 in 5f02c8a
Obviously, this would be a breaking change: both method signature and return type would change.
We can avoid the breaking change by making nullability a separate method, but it would be good to first determine the ideal end-state, and only then think how to get there.
ba27f93 to
24f4ff1
Compare
24f4ff1 to
53b34ae
Compare
`count([DISTINCT [expr]])` aggregate function never returns null. Infer
non-nullness of such aggregate expression. This allows elimination of
the HAVING filter for a query such as
SELECT ... count(*) AS c
FROM ...
GROUP BY ...
HAVING c IS NOT NULL
53b34ae to
1a7d541
Compare
|
thanks @jonahgao for your review! updated. |
* Fix test function name typo
* Improve formatting
* Infer count() aggregation is not null
`count([DISTINCT [expr]])` aggregate function never returns null. Infer
non-nullness of such aggregate expression. This allows elimination of
the HAVING filter for a query such as
SELECT ... count(*) AS c
FROM ...
GROUP BY ...
HAVING c IS NOT NULL
In apache#11256 it started inferring, for logical plans only, that count() aggregation does not return nulls. However, it did not introduce the counterpart inference for physical plans. Somehow, this discrepancy did not get detected by DF tests, but we run into it with, e.g. `sdf run -q "SELECT count(*) FROM (VALUES (3, 'c')) as T(c, x) GROUP BY c "` The fix here is ad hoc. We should keep an eye on apache#11274 where the goal seems to make UDAFs to communicate nullability of their result. If done right, that should have affect both logical and physical plans.
In apache#11256 it started inferring, for logical plans only, that count() aggregation does not return nulls. However, it did not introduce the counterpart inference for physical plans. Somehow, this discrepancy did not get detected by DF tests, but we run into it with, e.g. `sdf run -q "SELECT count(*) FROM (VALUES (3, 'c')) as T(c, x) GROUP BY c "` The fix here is ad hoc. We should keep an eye on apache#11274 where the goal seems to make UDAFs to communicate nullability of their result. If done right, that should have affect both logical and physical plans.
count([DISTINCT [expr]])aggregate function never returns null. Infer non-nullness of such aggregate expression. This allows elimination of the HAVING filter for a query such as