[Parquet] Provide only encrypted column stats in plaintext footer#8305
[Parquet] Provide only encrypted column stats in plaintext footer#8305alamb merged 27 commits intoapache:mainfrom
Conversation
241c532 to
a5a44f4
Compare
3a7de46 to
e2ebc45
Compare
03e52aa to
5459ffc
Compare
078b622 to
686b38f
Compare
adamreeve
left a comment
There was a problem hiding this comment.
Looks good thanks Rok, I think this can just be tidied up a bit to reduce some duplication
parquet/src/file/metadata/writer.rs
Outdated
| let is_footer_encrypted = file_encryptor.properties().encrypt_footer(); | ||
|
|
||
| if !is_footer_encrypted { | ||
| // Temporarily clear crypto_metadata so statistics get included in encrypted blob |
There was a problem hiding this comment.
We don't need this step if we check for encrypted_column_metadata instead of crypto_metadata in serialize_column_meta_data (https://github.com/apache/arrow-rs/pull/8305/files#r2543948275)
|
Thanks for the review @adamreeve and nice test suggestions, more explicitly presented cases help reason about this. |
adamreeve
left a comment
There was a problem hiding this comment.
Nice, thanks Rok! Just one minor comment
|
@etseidl would you have time to give this a quick look before we merge? |
etseidl
left a comment
There was a problem hiding this comment.
Thanks @rok and @adamreeve, this looks good to me. Just one question (because I'm too lazy to read the spec >.<).
| if let Some(page_encoding_stats) = column_chunk.page_encoding_stats() { | ||
| last_field_id = page_encoding_stats.write_thrift_field(w, 13, last_field_id)?; | ||
|
|
||
| if should_write_column_stats(column_chunk) { |
There was a problem hiding this comment.
Is the idea here to only write the required bits of the column meta_data so readers won't complain? If so, then should all of the fields below also be skipped?
There was a problem hiding this comment.
Good point, I agree all the below fields should be skipped too.
I checked what the C++ implementation does and it's actually a bit different between the plaintext footer case and when the footer is encrypted but the column is encrypted with a different key: https://github.com/apache/arrow/blob/cbd36b817fc77812f8df1a15bf24314de3b27f29/cpp/src/parquet/metadata.cc#L1748-L1755.
When there's a plaintext footer, only the statistics and encoding_stats are stripped out of the unencrypted metadata, similar to what we're proposing here. But when the footer is also encrypted, it's assumed that the reader can handle when the whole metadata field isn't set so this is completely skipped, which is what was done before.
It might make sense to match what C++ does and keep the previous behaviour of excluding the full ColumnChunk metadata field when the footer is encrypted, and check that the reader can handle this when it doesn't have the column key. That should have the benefit of not increasing the footer size too much. And then in the plaintext footer case, also exclude the other fields below.
There was a problem hiding this comment.
Good catch @etseidl !
I've moved the other sensitive metadata into the conditional and added tests. (except for geospatial since it requires #[cfg(feature = "geospatial")] and it seems straightforward enough to not cover here?)
There was a problem hiding this comment.
Thanks @rok, looks good now. I'll leave it to @adamreeve if we want to revert to excluding the meta_data entirely when the footer is encrypted.
There was a problem hiding this comment.
Per @adamreeve suggestion this now omits metadata in case encrypted metadata is present and strips out stats if it is not.
parquet/src/file/metadata/writer.rs
Outdated
| // We always encrypt column metadata separately and store in | ||
| // encrypted_column_metadata. This allows skipping the plaintext meta_data | ||
| // field in the writer to reduce footer size. If encrypted_column_metadata | ||
| // were not set, the reader would not be able to read the column metadata. | ||
| Some(file_encryptor.get_footer_encryptor()?) |
There was a problem hiding this comment.
I don't think this is the right approach. If the footer is already encrypted we shouldn't need to double-encrypt the column metadata. This goes against what the spec describes and might be incompatible with some readers, plus it will require extra decryption time when reading.
If it's too complex to wire through the information about whether the footer is a plaintext or not to serialize_column_meta_data, I think we should stick with what you had before, or go back to what we currently do in the main branch where the full column metadata is skipped rather than just the stats. My understanding is that writing the metadata in plaintext without the stats was for backwards compatibility with readers that expect the column metadata to always be present when encryption was first being added. But that might not be such a concern now if readers know that field is optional.
There was a problem hiding this comment.
Sorry, forgot to reply here. I've introduced bool ColumnChunkMeteData.plaintext_footer_mode that now helps writer decide if it should write metadata or not. See change.
a6a3ec2 to
5e195dc
Compare
adamreeve
left a comment
There was a problem hiding this comment.
This all looks good to me now thanks Rok, and it's nice that we match what the C++ library does.
bb283d4 to
2bf1e29
Compare
|
Happy new year @etseidl ! |
|
Merged up to resolve a conflict |
|
Tests required setting |
etseidl
left a comment
There was a problem hiding this comment.
Change to the test is fine. Alternatively you could leave the defaults and change to testing if the encoding stats mask is Some. Not worth waiting any longer, however.
|
Shall we merge this one in? |
|
Yes please :) |
…ache#8305) - Closes apache#8304. See apache#8304 and especially apache#8304 (comment) --------- Co-authored-by: Adam Reeve <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>
…text footer (#8305) (#9310) - Part of #9240 - Related to #8304 This is a backport of the following PR to the 57 line - #8305 from @rok Co-authored-by: Rok Mihevc <[email protected]> Co-authored-by: Adam Reeve <[email protected]>
Rationale for this change
See #8304 and especially #8304 (comment)