Prune columns are all null in ParquetExec by row_counts , handle IS NOT NULL#9989
Prune columns are all null in ParquetExec by row_counts , handle IS NOT NULL#9989alamb merged 7 commits intoapache:mainfrom
Conversation
| column_statistics: Vec<ParquetStatistics>, | ||
| ) -> RowGroupMetaData { | ||
| let mut columns = vec![]; | ||
| let number_row = 1000; |
There was a problem hiding this comment.
Before all unit test set each col with default 0 row, will trigger num_rows == num_nulls
There was a problem hiding this comment.
I was wondering if this could cause problems in real files (for example, if the row counts were not included in the statistics that were written into the file).
However, I double checked the code and it seems like ColumnChunkMetaData::num_values is non nullable so I think we are good.
waynexia
left a comment
There was a problem hiding this comment.
Looks good to me 👍 Thanks @Ted-Jiang
| None | ||
| fn row_counts(&self, column: &Column) -> Option<ArrayRef> { | ||
| let (c, _) = self.column(&column.name)?; | ||
| let scalar = ScalarValue::UInt64(Some(c.num_values() as u64)); |
There was a problem hiding this comment.
Another implementor returns an Int64 array. I'm not sure if the type matters?
There was a problem hiding this comment.
Try to keep with https://github.com/apache/arrow-datafusion/blob/18c9f4d7e68b3eb470b2b2ef3f2297e018dd4298/datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs#L340-L342 avoid cast during = , i think Int64Array only use in test
There was a problem hiding this comment.
UInt64 also matches the documentation https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/trait.PruningStatistics.html#tymethod.row_counts 👍
I made a PR to fix the tests: #10004
Co-authored-by: Ruihang Xia <[email protected]>
|
Thanks @Ted-Jiang -- this looks awesome -- I look forward to reviewing it later today |
alamb
left a comment
There was a problem hiding this comment.
Thanks @Ted-Jiang -- and @waynexia . I think the code looks good, though I think a few more tests would improve the coverage we could add them as a follow on PR.
Do you plan to add support in page_filter.rs as well (maybe that is why the PR is marked "Part #9961 ")?
| column_statistics: Vec<ParquetStatistics>, | ||
| ) -> RowGroupMetaData { | ||
| let mut columns = vec![]; | ||
| let number_row = 1000; |
There was a problem hiding this comment.
I was wondering if this could cause problems in real files (for example, if the row counts were not included in the statistics that were written into the file).
However, I double checked the code and it seems like ColumnChunkMetaData::num_values is non nullable so I think we are good.
| // After pruning, only row group 1,3 should be selected | ||
| RowGroupPruningTest::new() | ||
| .with_scenario(Scenario::AllNullValues) | ||
| .with_query("SELECT * FROM t WHERE \"i8\" is Null") |
There was a problem hiding this comment.
Could you also add a tests:
i16 IS NOT NULL(to cover the opposite)i32 > 7(prune via nulls and some via counts)
| None | ||
| fn row_counts(&self, column: &Column) -> Option<ArrayRef> { | ||
| let (c, _) = self.column(&column.name)?; | ||
| let scalar = ScalarValue::UInt64(Some(c.num_values() as u64)); |
There was a problem hiding this comment.
UInt64 also matches the documentation https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/trait.PruningStatistics.html#tymethod.row_counts 👍
I made a PR to fix the tests: #10004
alamb
left a comment
There was a problem hiding this comment.
Thanks @Ted-Jiang -- the support for IS NOT NULL is also quite neat.
@alamb sorry there is a bug here i will fix this it should |
…atistics
Which issue does this PR close?
Part #9961 .
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?