Update ListingTable to use StatisticsConverter, remove redundant statistics extraction code#10924
Closed
alamb wants to merge 5 commits intoapache:mainfrom
Closed
Update ListingTable to use StatisticsConverter, remove redundant statistics extraction code#10924alamb wants to merge 5 commits intoapache:mainfrom
alamb wants to merge 5 commits intoapache:mainfrom
Conversation
alamb
commented
Jun 14, 2024
| ParquetStatistics::Int64(s) if DataType::Int64 == *fields[i].data_type() => { | ||
| if let Some(max_value) = &mut max_values[i] { | ||
| max_value | ||
| .update_batch(&[Arc::new(Int64Array::from_value(*s.max(), 1))]) |
Contributor
Author
There was a problem hiding this comment.
Previously, even though update_batch is called, it first creates a single row array
| } | ||
| } | ||
| let file_field = file_schema.field(file_idx); | ||
| let Some(converter) = StatisticsConverter::try_new( |
Contributor
Author
There was a problem hiding this comment.
this code now uses the well tested StatisticsConverter to extract statistics from the parquet file with the correct type of array in a single call
| // Note in ASCII lower case is greater than upper case | ||
| assert_eq!( | ||
| c1_stats.max_value, | ||
| Precision::Exact(ScalarValue::from("bar")) |
Contributor
Author
There was a problem hiding this comment.
strings were previously not handled, but are now properly handled by StatisticsConverter
90a5ea4 to
3e6a537
Compare
alamb
commented
Jun 17, 2024
| } | ||
| } | ||
|
|
||
| /// Set the null count for the column |
Contributor
Author
There was a problem hiding this comment.
These functions make creating ColumnStatitistic more ergonomic
xinlifoobar
approved these changes
Jun 24, 2024
Contributor
Author
|
Thank you for the review @xinlifoobar
Update: let's work with #11068 (I didn't realize there was another PR) Thanks @xinlifoobar |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #10923
Rationale for this change
The code that extracted statistics for ListingTable with parquet files:
I also personally also found the previous code somewhat hard to follow
See #10923 for more details
What changes are included in this PR?
Update Parquet listing table to use the new
StatisticsConverterAPI to create DataFusion statisticsAre these changes tested?
Yes, they are covered by existing tests
Are there any user-facing changes?
ListingFiles will now have more accurate statistics