Make unnest consistent with DuckDB/ClickHouse, add option for preserve_nulls, update docs#7088
Make unnest consistent with DuckDB/ClickHouse, add option for preserve_nulls, update docs#7088alamb wants to merge 5 commits intoapache:mainfrom
unnest consistent with DuckDB/ClickHouse, add option for preserve_nulls, update docs#7088Conversation
| } | ||
|
|
||
| #[tokio::test] | ||
| async fn unnest_column_nulls() -> Result<()> { |
There was a problem hiding this comment.
Here is a test that I think shows the difference between the two modes (or will once we implement it)
| /// Unnest a column that contains a nested list type, replicating | ||
| /// values in the other, non nested rows. | ||
| /// | ||
| /// Conceptually this operation is like joining each row with all the |
There was a problem hiding this comment.
Here is the proposed behavior
| /// ``` | ||
| pub fn unnest_column(self, column: &str) -> Result<DataFrame> { | ||
| /// [Unnest]: crate::logical_expr::Unnest | ||
| pub fn unnest_column(self, column: &str, preserve_nulls: bool) -> Result<DataFrame> { |
There was a problem hiding this comment.
I think this is different than what @vincev proposed in #7044 (comment)
Specifically, there are three ways to make unnest plans:
unnestfunctionDataframe::unnestLogicalPlanBuilder
It seemed inconsistent to add a preserve_nulls argument in only one place
An alternative would be to add new unest_with_options, DataFrame::unest_with_options and LogicalPlanBuilder::with_options and leave the existing signatures alone.
What do reviewers think?
There was a problem hiding this comment.
🤔 seeing the scope of the required changes to add a new parameter, I am thinking _with_options would be preferable
There was a problem hiding this comment.
we could just keep the overload but add the options struct instead of only the bool flag.
There was a problem hiding this comment.
_with_options looks nice to me too, DataFrame::unnest would call _with_options preserving nulls so that users will keep getting the same behavior as before that is consistent with other data frame libraries and SQL unnest would use the DuckDB/ClickHouse behavior.
There was a problem hiding this comment.
Ok, I'll give it a try
|
Looks great imo, would love to move the improvements form #7002 into a new branch on top of this + implement handling the flag |
|
Sorry I did not make progress on this today; I have it on my list for tomorrow (or anyone else feel free to make a PR using any of this PR they want) |
I can work on this tomorrow if you don't mind. The plan is to leave (and mark deprecated) the existing UnnestExec and create a new variant that handles options such as preserve_nulls. Then move all api calls to this new physical plan where sql uses perserve_nulls = false, and the dataframe api uses preserve_nulls = true. Right? |
|
BTW I am sorry for my delay -- unfortunately sorting out list / nested types is not as high a priority for us at InfluxData at this time, even though I think it is very important for DataFusion overall. Thus I have to work on this project in my spare time. I am definitely still planning to work on this PR, but if anyone else has time and would like to try and take it on that would also be great 🙏 I was thinking about making /// <docs about unnest behavior here>
struct UnestOptions {
pub preserve_nulls: ...
}I would suggest adding a new To keep the size of PRs down, I also would personally suggest adding in the API in one PR but implementing the new |
|
starting to make another proposal PR... |
Sorry for not picking this up. I'm not sure it makes sense to add the Options as a struct vs just a single preserve_nulls field in the logical/physical plan. Happy to drive the implementation and include the improvements mentioned here |
No worries -- sorry for the delay
Yeah I can see the argument for just a field. The nice thing about the struct was it gave me a place to put the documentation on that could be referenced as well as adding a place that any potential future options could be added. I don't have a strong preference
That would be great |
|
This was superceded by #7168 |
Which issue does this PR close?
Closes #7087
Closes #7044
Rationale for this change
This implements the prospoal from @jayzhan211 @izveigor @vincev and @smiklos from #7044 (comment)
What changes are included in this PR?
unnestconsistent with DuckDB/ClickHouse (do not preserve nulls),Are these changes tested?
yes
Are there any user-facing changes?
yes, API has changed