Skip to content

Conversation

@etseidl
Copy link
Contributor

@etseidl etseidl commented Aug 5, 2025

Note: this targets a feature branch, not main

Which issue does this PR close?

Rationale for this change

This is the first step towards a rework of Parquet metadata handling. This PR attempts to remove as many references as possible to the parquet::format module in the public API. This is done by creating new enums and structs that mirror their format counterparts and using them in publicly exposed structures like FileMetaData.

What changes are included in this PR?

Are these changes tested?

Current tests should suffice for now. More thorough tests will be added as needed.

Are there any user-facing changes?

Yes, public facing interfaces should no longer expose format

@github-actions github-actions bot added the parquet Changes to the parquet crate label Aug 5, 2025
@etseidl etseidl added the api-change Changes to the arrow API label Aug 5, 2025
@alamb
Copy link
Contributor

alamb commented Aug 6, 2025

I plan to review this later today

@alamb alamb changed the title Reduce use of parquet::format in the public API [thirft-remodel] Reduce use of parquet::format in the public API Aug 6, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like an improvement to me -- thank you @etseidl

My main concern with this type of chainge is it will basically be picking up conflicts non stop until we open main for breaking API changes

I think we can proceed like this, but I do worry about eventually bringing everything back together.


use crate::errors::{ParquetError, Result};

// Re-export crate::format types used in this module
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the main API change in this PR, right? No more directly exporting the thrift definitions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as well as replacing uses of format structures with the new ones defined here.

// ----------------------------------------------------------------------
// parquet::BloomFilterHash <=> BloomFilterHash conversion

impl From<parquet::BloomFilterHash> for BloomFilterHash {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea that these are transition mappings from the thrift structures --> the native Rust structures?

I am hoping that the end stage will be a single BloomFilterHash enum in the rust parquet crate that is serialized/deserialized directly to thrift by your new fancy encoder/decoder. Is this your vision too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a stepping stone. Eventually there will be a single BloomFilterHash (or any other thrift based struct/enum) that is populated directly from a reader much like TCompactSliceInputProtocol. Once this merges the next PR which is almost ready to go converts a number of thrift enums and unions to the new parser. This step will need a lot of feedback/help. I have something that works, but there are warts.

}
}

impl From<BloomFilterHash> for parquet::BloomFilterHash {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW this is not anything you did, but I found the use if use crate::format as parquet very confusing -- given how many things are called parquet

I would personally suggest we use crate::format explicitly everywhere for clarity

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do that...I was debating it and got lazy reasoning that format will eventually go away and with it the confusion. But it would make it easier to see what's going on in the interim. I've already tried to get rid of use crate::format so that anything using format will be immediately apparent.

@alamb
Copy link
Contributor

alamb commented Aug 6, 2025

I also updated the subject and description of this PR to make it clear it targets a feature branch

@etseidl
Copy link
Contributor Author

etseidl commented Aug 6, 2025

My main concern with this type of chainge is it will basically be picking up conflicts non stop until we open main for breaking API changes

Thanks for taking a look @alamb. Yes, with a big change like this conflicts are inevitable. I'll try to update the branch frequently to keep code rot to a minimum. We'll see if any conflicts arise that can't easily be dealt with. In the end, the purpose of this exercise is to hone the design while avoid breaking changes in main. If need be when the time comes to release this I don't mind re-implementing from scratch at a more measured pace. Hopefully the month window for breaking changes would be enough time to get things merged.

@etseidl etseidl changed the title [thirft-remodel] Reduce use of parquet::format in the public API [thrift-remodel] Reduce use of parquet::format in the public API Aug 6, 2025
@etseidl
Copy link
Contributor Author

etseidl commented Aug 6, 2025

On to the next step.

@etseidl etseidl merged commit 6d6f4e3 into apache:gh5854_thrift_remodel Aug 6, 2025
16 checks passed
@etseidl etseidl deleted the no_format branch October 10, 2025 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-change Changes to the arrow API parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants