-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[thrift-remodel] Reduce use of parquet::format in the public API
#8048
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I plan to review this later today |
parquet::format in the public APIparquet::format in the public API
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like an improvement to me -- thank you @etseidl
My main concern with this type of chainge is it will basically be picking up conflicts non stop until we open main for breaking API changes
I think we can proceed like this, but I do worry about eventually bringing everything back together.
|
|
||
| use crate::errors::{ParquetError, Result}; | ||
|
|
||
| // Re-export crate::format types used in this module |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the main API change in this PR, right? No more directly exporting the thrift definitions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, as well as replacing uses of format structures with the new ones defined here.
parquet/src/basic.rs
Outdated
| // ---------------------------------------------------------------------- | ||
| // parquet::BloomFilterHash <=> BloomFilterHash conversion | ||
|
|
||
| impl From<parquet::BloomFilterHash> for BloomFilterHash { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the idea that these are transition mappings from the thrift structures --> the native Rust structures?
I am hoping that the end stage will be a single BloomFilterHash enum in the rust parquet crate that is serialized/deserialized directly to thrift by your new fancy encoder/decoder. Is this your vision too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a stepping stone. Eventually there will be a single BloomFilterHash (or any other thrift based struct/enum) that is populated directly from a reader much like TCompactSliceInputProtocol. Once this merges the next PR which is almost ready to go converts a number of thrift enums and unions to the new parser. This step will need a lot of feedback/help. I have something that works, but there are warts.
parquet/src/basic.rs
Outdated
| } | ||
| } | ||
|
|
||
| impl From<BloomFilterHash> for parquet::BloomFilterHash { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW this is not anything you did, but I found the use if use crate::format as parquet very confusing -- given how many things are called parquet
I would personally suggest we use crate::format explicitly everywhere for clarity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can do that...I was debating it and got lazy reasoning that format will eventually go away and with it the confusion. But it would make it easier to see what's going on in the interim. I've already tried to get rid of use crate::format so that anything using format will be immediately apparent.
|
I also updated the subject and description of this PR to make it clear it targets a feature branch |
Thanks for taking a look @alamb. Yes, with a big change like this conflicts are inevitable. I'll try to update the branch frequently to keep code rot to a minimum. We'll see if any conflicts arise that can't easily be dealt with. In the end, the purpose of this exercise is to hone the design while avoid breaking changes in main. If need be when the time comes to release this I don't mind re-implementing from scratch at a more measured pace. Hopefully the month window for breaking changes would be enough time to get things merged. |
parquet::format in the public APIparquet::format in the public API
|
On to the next step. |
Note: this targets a feature branch, not main
Which issue does this PR close?
Rationale for this change
This is the first step towards a rework of Parquet metadata handling. This PR attempts to remove as many references as possible to the
parquet::formatmodule in the public API. This is done by creating new enums and structs that mirror theirformatcounterparts and using them in publicly exposed structures likeFileMetaData.What changes are included in this PR?
Are these changes tested?
Current tests should suffice for now. More thorough tests will be added as needed.
Are there any user-facing changes?
Yes, public facing interfaces should no longer expose
format