Refactor: add FileGroup structure for Vec<PartitionedFile>#15379
Refactor: add FileGroup structure for Vec<PartitionedFile>#15379jayzhan211 merged 5 commits intoapache:mainfrom
FileGroup structure for Vec<PartitionedFile>#15379Conversation
| /// The files in this group | ||
| pub files: Vec<PartitionedFile>, | ||
| /// Optional statistics for all files in the group | ||
| pub statistics: Option<Statistics>, |
There was a problem hiding this comment.
We can compute the statistics in the method of ListingTable
datafusion/datafusion/core/src/datasource/listing/table.rs
Lines 1086 to 1091 in 5210a2b
There was a problem hiding this comment.
I recommend:
- Remove the
pubfrom the fields (so we can potentially change the representation later) - Add a
into_inner()method that returns the inner Vec
cda1c47 to
1dc5153
Compare
1dc5153 to
7df36ed
Compare
| ) -> Result<(Vec<PartitionedFile>, Statistics)> { | ||
| let mut result_files = vec![]; | ||
| ) -> Result<(FileGroup, Statistics)> { | ||
| let mut result_files = FileGroup::default(); |
There was a problem hiding this comment.
What are the difference between the statistics in FileGroup and the returned value here? Do we need statistics in FileGroup, if so, could we add the returned Statistics inside FileGroup?
There was a problem hiding this comment.
Can we return result_files.with_statistics(statistics)? I don't see any other place where with_statistics is called. The inner optional statistics in FileGroup is a bit confusing to me.
There was a problem hiding this comment.
@jayzhan211 , yeah, it's a todo task, I'll compute the FileGroup's statistic in the method and use the with_statistics method to set statistics for FileGroup later.
alamb
left a comment
There was a problem hiding this comment.
Thank you @xudong963
I think encapsulating Vec<PartitionedFile> into FileGroup is a very nice idea and long in coming. However I do think it is an API change so I have maked this PR with that tag.
I also added this to the list of API changes that should have entries in the upgrade guide on
| const CONCURRENCY_LIMIT: usize = 100; | ||
|
|
||
| /// Partition the list of files into `n` groups | ||
| pub fn split_files( |
There was a problem hiding this comment.
this is an API change -- can you perhaps leave the function here and mark it #deprecated per https://datafusion.apache.org/contributor-guide/api-health.html#deprecation-guidelines
| /// The files in this group | ||
| pub files: Vec<PartitionedFile>, | ||
| /// Optional statistics for all files in the group | ||
| pub statistics: Option<Statistics>, |
There was a problem hiding this comment.
I recommend:
- Remove the
pubfrom the fields (so we can potentially change the representation later) - Add a
into_inner()method that returns the inner Vec
| } | ||
|
|
||
| /// Partition the list of files into `n` groups | ||
| pub fn split_files(&mut self, n: usize) -> Vec<FileGroup> { |
There was a problem hiding this comment.
Rather than taking &mut self here, I think this would be easier to use if it took mut self
As written it will leave self empty which is not super obvious. If it took mut self then it would be clear that the FileGroup is consumed
There was a problem hiding this comment.
Yes, makes a lot sense
| let files = iter.into_iter().collect(); | ||
| FileGroup::new(files) | ||
| } | ||
| } |
There was a problem hiding this comment.
I also recommend a From impl for the vec
| } | |
| } | |
| impl From<Vec<PartitionedFile>> for FileGroup { | |
| ... | |
| } |
There was a problem hiding this comment.
Yeah, added in 7415f7c
Now I have three ways to create FileGroup by Vec<PartitionedFile>.
let files = vec![...];
let file_group = FileGroup::new(files);
let file_group = files.into();
let file_group = FileGroup::from(files);I don't strongly prefer which method to use, but the second is simpler!
| } | ||
|
|
||
| /// Partition the list of files into `n` groups | ||
| pub fn split_files(&mut self, n: usize) -> Vec<FileGroup> { |
There was a problem hiding this comment.
Rather than taking &mut self here, I think this would be easier to use if it took mut self
As written it will leave self empty which is not super obvious. If it took mut self then it would be clear that the FileGroup is consumed
FileGroup structureFileGroup structure for Vec<PartitionedFile>
Yes, I agree. Thank you for adding it. |
|
Thanks for your review. I'll merge it after @alamb has another look! |
|
|
||
| /// Partition the list of files into `n` groups | ||
| pub fn split_files(&mut self, n: usize) -> Vec<FileGroup> { | ||
| pub fn split_files(mut self, n: usize) -> Vec<FileGroup> { |
|
Thanks @xudong963 @alamb |
…e#15379) * Refactor: add FileGroup structure * update * address comments * address comments from alamb * fix clippy
…e#15379) * Refactor: add FileGroup structure * update * address comments * address comments from alamb * fix clippy
Which issue does this PR close?
Rationale for this change
FileGroupencapsulates both files and their statistics, enabling more effective query planning and optimization based on file group characteristics. Moving from rawVec<PartitionedFile>to a dedicated structure creates a proper abstraction that bundles related functionality together.FileGroup's specific API in the future, currently, one that comes into my mind is part of thesplit_groups_by_statisticsmethod, it make any two files in a file group are ordered.file_groupsclearer, frompub file_groups: Vec<Vec<PartitionedFile>>topub file_groups: Vec<FileGroup>What changes are included in this PR?
Introduce
FileGroupand implement the related APIs, replace allVec<PartitionedFile>withFileGroupAre these changes tested?
Yes
Are there any user-facing changes?