Conversation
1d78ac4 to
a2818bf
Compare
wiedld
left a comment
There was a problem hiding this comment.
If you are planning to eventually deprecate the non-builder ParquetMetaData::new*, then this would also need to be switched to the builder.
There was a problem hiding this comment.
I think the cleanup of this code (which is modifying the ParquetMetaData) is the best example of why having this API makes sense -- it makes one fewer copies and also I think is quite a bit clearer
parquet/src/file/metadata/mod.rs
Outdated
There was a problem hiding this comment.
these methods follow the exsiting convention of add and set as the other Builders in this crate sucha s
- https://docs.rs/parquet/latest/parquet/file/metadata/struct.RowGroupMetaDataBuilder.html
- https://docs.rs/parquet/latest/parquet/file/metadata/struct.ColumnChunkMetaDataBuilder.html
That is a good point -- I wasn't planing to deprecate the functions, though we could argume that deprecating |
|
Thank you for the review @wiedld |
| let mut offset_indexes = vec![]; | ||
|
|
||
| for rg in &mut filtered_row_groups { | ||
| for rg in metadata_builder.row_groups().iter() { |
There was a problem hiding this comment.
I think we can build the metadata here (with the filtered row groups), pass it into ParquetMetaDataReader and then load the page indexes into the metadata. Let me give that a try.
There was a problem hiding this comment.
if options.enable_page_index {
let mut reader = ParquetMetaDataReader::new_with_metadata(metadata_builder.build())
.with_page_indexes(options.enable_page_index);
reader.read_page_indexes(&chunk_reader)?;
metadata_builder = ParquetMetaDataBuilder::new_from_metadata(reader.finish()?);
}I forgot to do this in #6450.
There was a problem hiding this comment.
I don't quite follow why this is needed. What scenario does it help (I can write a test to cover it)
There was a problem hiding this comment.
I mean replace
if options.enable_page_index {
let mut columns_indexes = vec![];
let mut offset_indexes = vec![];
for rg in metadata_builder.row_groups().iter() {
let column_index = index_reader::read_columns_indexes(&chunk_reader, rg.columns())?;
let offset_index = index_reader::read_offset_indexes(&chunk_reader, rg.columns())?;
columns_indexes.push(column_index);
offset_indexes.push(offset_index);
}
metadata_builder = metadata_builder
.set_column_index(Some(columns_indexes))
.set_offset_index(Some(offset_indexes));
}with the above code snippet from my earlier comment. This should be a bit more efficient since read_page_indexes will fetch the necessary bytes from the file in a single read, rather than 2 reads per row group.
There was a problem hiding this comment.
I implemented something slightly different:
Since there is already a ParquetMetaDataReader created at the beginning of the function, I made the change to simply read it when needed.
One thing that might be different is that it looks like the current code may only read the column index/page index for row groups that passed the "predicates" but the ParquetMetadataReader reads the index for all the row groups.
That being said, no tests fail, so i am not sure if it is a real problem or not
There was a problem hiding this comment.
Hmm, this worries me a bit, since the column and offset indexes will have more row groups represented than are in the ParquetMetaData. The split path from before would only read the page indexes for the remaining row groups.
We could prune the page indexes at the same time we're pruning the row groups.
There was a problem hiding this comment.
We could prune the page indexes at the same time we're pruning the row groups.
Yeah, that is probably mirrors the intent the most closely. How about I'll back out c0432e6 and we can address improving this code as a follow on PR (along with tests)
I deprecated |
This reverts commit c0432e6.
Which issue does this PR close?
Closes #6465
Rationale for this change
At the moment it is
See #6465 for more rationale
What changes are included in this PR?
ParquetMetaDataBuilderParquetMetaDatato useParquetMetadataBuilder(which requires fewer clones) so might be marginally fasterAre there any user-facing changes?