Add options to control various aspects of Parquet metadata decoding#8763
Add options to control various aspects of Parquet metadata decoding#8763etseidl merged 23 commits intoapache:mainfrom
Conversation
|
Here's an excerpt from a run of the new benchmark that shows the schema is actually skipped. This should get even faster with the metadata index (#8714) |
| .with_column_index_policy(self.column_index) | ||
| .with_metadata_options(self.metadata_options.clone()); |
There was a problem hiding this comment.
At some point I could see moving the page index policy into the MetadataOptions and then deprecating a bunch of setters.
parquet/src/file/metadata/parser.rs
Outdated
| // the credentials and keys needed to decrypt metadata | ||
| file_decryption_properties: Option<Arc<FileDecryptionProperties>>, | ||
| // metadata parsing options | ||
| metadata_options: Option<MetadataOptions>, |
There was a problem hiding this comment.
Wondering if this should be Option<Arc<MetadataOptions>> everywhere.
|
This may help with #5999 |
| /// [`ParquetMetaDataPushDecoder`]: crate::file::metadata::ParquetMetaDataPushDecoder | ||
| #[derive(Default, Debug, Clone)] | ||
| pub struct MetadataOptions { | ||
| schema_descr: Option<SchemaDescPtr>, |
There was a problem hiding this comment.
Does this means (1) User provided schema or (2) only (min, max, etc) columns in schema_descr be decoded?
There was a problem hiding this comment.
It's (1). Say you have a large number of files that share the same schema, there's no need to decode them all. Just grab the schema from the first file and use it for all the others.
There was a problem hiding this comment.
Here is a ticket that explains the use case a bit more;
alamb
left a comment
There was a problem hiding this comment.
this API looks good to me (and actually closes an existing ticket)
| /// [`ParquetMetaDataPushDecoder`]: crate::file::metadata::ParquetMetaDataPushDecoder | ||
| #[derive(Default, Debug, Clone)] | ||
| pub struct MetadataOptions { | ||
| schema_descr: Option<SchemaDescPtr>, |
There was a problem hiding this comment.
Here is a ticket that explains the use case a bit more;
|
(I didn't approve it b/c it is still marked as a draft) |
Thanks @alamb. I'm still messing around with the API. I think I like hiding the new options object in the file reader APIs, and just exposing it for the metadata readers. The last wrinkle is figuring out a good way to share across the |
|
Ok, I think this is ready now. Right now I'm mildly against pulling in the page index policies. They are used at a higher level and I don't think it's worth the thrash to move them. Instead I want to focus on options that impact the |
| supplied_schema: Option<SchemaRef>, | ||
| /// Policy for reading offset and column indexes. | ||
| pub(crate) page_index_policy: PageIndexPolicy, | ||
| /// Options to control reading of Parquet metadata |
There was a problem hiding this comment.
I reviewed the ArrowReaderOptions and ArrowReaderMetadata structures and their use, and I agree this is the appropriate structure to add metadata parsing to.
Do you think it eventually makes sense to move the other fields from ArrowReaderOptions to ParquetMetaDataOptions? (e.g. supplied_schema)
There was a problem hiding this comment.
I was thinking perhaps the page_index_policy, but the other things in ArrowReaderOptions are more Arrow specific rather than Parquet. That might get confusing.
| /// [Parquet Spec]: https://github.com/apache/parquet-format#metadata | ||
| pub fn decode_metadata(buf: &[u8]) -> Result<ParquetMetaData> { | ||
| decode_metadata(buf) | ||
| decode_metadata(buf, None) |
There was a problem hiding this comment.
I wonder if we should start directing people to the push metadata decoder (the metadata reader is getting pretty complicated...)
There was a problem hiding this comment.
Yes, that would be nice. Maintaining two public APIs that do pretty much the same thing is a bit too much.
Co-authored-by: Andrew Lamb <[email protected]>
…erMetadata` (#8798) # Which issue does this PR close? - Related to #8763 - Related to #7973 # Rationale for this change While reviewing and hunting around for structs I got confused about the difference between `ArrowReaderOptions` and `ArrowReaderMetadata` # What changes are included in this PR? Add some doc strings to try and make it easier for people to navigate the documentation # Are these changes tested? by CI # Are there any user-facing changes? New documentation. No functional changes
|
If there are no objections I'll merge this tomorrow. A follow-on is already in the works to deal with the page encoding statistics (#8797), with other statistics objects to be handled after that. I've also got some speed ups for the thrift skipping code on tap (replace the current version that uses recursion with one that maintains state on the stack). |
Which issue does this PR close?
SchemaDescriptorPtracrossParquetMetadataobjects #5999Rationale for this change
This is a first attempt at an object to help control the parsing of the Parquet metadata.
What changes are included in this PR?
Adds a new
MetadataOptionsstruct, and plumbs it down into the Thrift decoder code. The only option for now is to pass in a schema, which then causes the decoder to skip decoding the schema contained in the footer.Also adds to the metadata bench to demonstrate the time savings from reusing the schema.
Are these changes tested?
Yes, adds a new test.
Are there any user-facing changes?
If there are user-facing changes then we may require documentation to be updated before approving the PR.
If there are any breaking changes to public APIs, please call them out.