Fixes 3 bugs during serialization and deserialization of physical plans#16858
Fixes 3 bugs during serialization and deserialization of physical plans#16858alamb merged 1 commit intoapache:mainfrom
Conversation
| bool distinct = 3; | ||
| bool ignore_nulls = 6; | ||
| optional bytes fun_definition = 7; | ||
| string human_display = 8; |
There was a problem hiding this comment.
Fix for bug 1 that does not serialize human_display for aggregate function
| optional bytes fun_definition = 3; | ||
| datafusion_common.ArrowType return_type = 4; | ||
| bool nullable = 5; | ||
| string return_field_name = 6; |
There was a problem hiding this comment.
Fix for bug 3 that does not serialize ScarlarFunction name
| // Only 4 queries pass: q3, q5, q10, q12 | ||
| // Ignore the test until the bug is fixed | ||
| #[ignore] | ||
| // Bugs: https://github.com/apache/datafusion/issues/16772 |
There was a problem hiding this comment.
Turn on all 22 tpc-h queries
| } | ||
| } else if group_expr.is_empty() { | ||
| // No GROUP BY clause - create empty PhysicalGroupBy | ||
| Ok(PhysicalGroupBy::new(vec![], vec![], vec![])) |
There was a problem hiding this comment.
One of the places that fixes bug 2 that should return nothing for non-groupby
There was a problem hiding this comment.
I wonder if we should look into changing group_expr: Option<&[Expr]> or something so the compiler makes it more likely that we handle the difference between an emptyVecvsNone`
There was a problem hiding this comment.
@alamb : I had a look. This would be very challenging and a lot of changes needed in physical plan, logical plan, parsing logic, serialization/deserialization. I do not think we want to make a big change just for this
There was a problem hiding this comment.
Maybe something we can ask an AI agent to try in a follow on PR
There was a problem hiding this comment.
I did, and those were the results I got. It even flagged a semantic error due to not distinguishing between no group and an empty one. Coding without AI these days feels almost impossible. 😄
| Field::new( | ||
| &e.return_field_name, | ||
| convert_required!(e.return_type)?, | ||
| true, |
There was a problem hiding this comment.
I wonder if this field could ever be non nullable, for example for udfs defined by DF users, then the deserialization would result in a mismatch (not in the scope of this fixes though)
There was a problem hiding this comment.
The change here is to use e.return_field_name in deserialization. So whatever name of the UDF we serialize at line 356 of the file to_proto.rs in this PR, it will be deserialized here; and it will match. Does this answer your concern?
There was a problem hiding this comment.
I was refering to the nullability of the return Field, that is always being set to true (even before this PR) i was wondering if this could be an issue later if the output of the scalar function could contain nulls, but is unrelated to the changes & fixes of this PR, sorry for the confusion 😄
LiaCastaneda
left a comment
There was a problem hiding this comment.
These are nice finds, thanks! 🙇♀️
|
@alamb : When you have a moment, could you take a quick look? Thanks! |
alamb
left a comment
There was a problem hiding this comment.
This looks like a good change to me -- thank you @NGA-TRAN and @LiaCastaneda
| } | ||
| } else if group_expr.is_empty() { | ||
| // No GROUP BY clause - create empty PhysicalGroupBy | ||
| Ok(PhysicalGroupBy::new(vec![], vec![], vec![])) |
There was a problem hiding this comment.
I wonder if we should look into changing group_expr: Option<&[Expr]> or something so the compiler makes it more likely that we handle the difference between an emptyVecvsNone`
|
Thanks again everyone! |
Which issue does this PR close?
Rationale for this change
Serialization and deserialization of physical plans are currently incomplete. This PR introduces the necessary fixes to ensure all TPC-H queries pass.
What changes are included in this PR?
Summary of Changes:
Are these changes tested?
Yes
Are there any user-facing changes?
The explain for the roundtrip query is now more accurate