Use upstream StatisticsConverter from arrow-rs in DataFusion#11479
Use upstream StatisticsConverter from arrow-rs in DataFusion#11479alamb merged 1 commit intoapache:mainfrom
StatisticsConverter from arrow-rs in DataFusion#11479Conversation
| column: &'b Column, | ||
| ) -> Result<StatisticsConverter<'a>> { | ||
| StatisticsConverter::try_new(&column.name, self.arrow_schema, self.parquet_schema) | ||
| Ok(StatisticsConverter::try_new( |
There was a problem hiding this comment.
this is required to get the errors to convert
| }}; | ||
| } | ||
|
|
||
| // Copy from arrow-rs |
There was a problem hiding this comment.
This is needed because the PagesPruningStatistics is not in terms of the StatisticsConverter code yet. I will try and do that in a separate PR and avoid the need for this
ab2b7ff to
2756145
Compare
1eb29a4 to
00eef87
Compare
00eef87 to
409023e
Compare
| use datafusion::datasource::physical_plan::{FileScanConfig, ParquetExec}; | ||
| use datafusion::datasource::TableProvider; | ||
| use datafusion::execution::object_store::ObjectStoreUrl; | ||
| use datafusion::parquet::arrow::arrow_reader::statistics::StatisticsConverter; |
There was a problem hiding this comment.
code has been ported to arrow-rs (🙏 @efredine )
| fn min_values(&self, column: &Column) -> Option<ArrayRef> { | ||
| self.statistics_converter(column) | ||
| .and_then(|c| c.row_group_mins(self.metadata_iter())) | ||
| .and_then(|c| Ok(c.row_group_mins(self.metadata_iter())?)) |
There was a problem hiding this comment.
this is necessary to convert from ArrowError to DataFusionError
There was a problem hiding this comment.
can we use map_err? or into?
There was a problem hiding this comment.
Thank you for the suggestion.
I played around with a few alternatives and I concluded they were not easier to understand, so I plan to leave it as is. If you feel strongly I will make a follow on PR to change.
I couldn't figure out a way to use into()
Option 1: using map_err
fn min_values(&self, column: &Column) -> Option<ArrayRef> {
self.statistics_converter(column)
.and_then(|c| {
c.row_group_mins(self.metadata_iter())
.map_err(DataFusionError::from)
})
.ok()
}Option 2: discard error earlier with ok()
Now there are two nested ok()s
fn min_values(&self, column: &Column) -> Option<ArrayRef> {
self.statistics_converter(column)
.ok()
.and_then(|c| c.row_group_mins(self.metadata_iter()).ok())
}| @@ -1,287 +0,0 @@ | |||
| // Licensed to the Apache Software Foundation (ASF) under one | |||
|
@Ted-Jiang I wonder if you have a moment to review this PR? |
|
Thank you for the reviews @Ted-Jiang and @comphead |
Which issue does this PR close?
Closes #10922
Closes #11000
Rationale for this change
Now that @efredine has ported this code to arrow in apache/arrow-rs#6046 we can remove the copy in DataFusion
What changes are included in this PR?
Remove code from DataFusion and use the upstream arrow-rs version
Are these changes tested?
Are there any user-facing changes?