Support multi-threaded writing of Parquet files with modular encryption#7818
Support multi-threaded writing of Parquet files with modular encryption#7818rok wants to merge 5 commits intoapache:mainfrom
Conversation
ee38fb5 to
cd5196f
Compare
cd5196f to
c2ac2d9
Compare
There was a problem hiding this comment.
👍 I like the approach of exposing the ArrowRowGroupWriterFactory to handle passing through the file encryption properties. We should probably also update the documentation on ArrowColumnWriter to encourage users to use this API, and/or add documentation to get_column_writers to encourage users to use this approach for compatibility with encryption.
There seem to be a lot of extra things that have been exposed publicly in this PR that don't need to be though, which makes it hard to follow what changes are actually needed. We should minimize how much is public so it's easier for users to follow documentation and see which APIs they should be using, and so that internal details can be changed later.
Co-authored-by: Adam Reeve <[email protected]>
86ddf45 to
b35d078
Compare
b35d078 to
2d42ca0
Compare
|
Thanks for the quick review @adamreeve ! |
bc293de to
4009f42
Compare
4009f42 to
b57a5d4
Compare
adamreeve
left a comment
There was a problem hiding this comment.
This is looking pretty good thanks Rok, I've added a few more comments
|
I've opened a draft PR against datafusion to check if this is usable for multi-threaded writes and it does seem to be. Switching this to ready for review. |
alamb
left a comment
There was a problem hiding this comment.
Thank you @rok and @adamreeve -- this is very exciting.
I have one question about API design but otherwise the basic idea looks great to me
|
|
||
| /// Encodes [`RecordBatch`] to a parquet row group | ||
| struct ArrowRowGroupWriter { | ||
| pub struct ArrowRowGroupWriter { |
There was a problem hiding this comment.
Is there a reason we can't use the existing low level APIs documented here: https://docs.rs/parquet/latest/parquet/arrow/arrow_writer/struct.ArrowColumnWriter.html#example-encoding-two-arrow-arrays-in-parallel
In other words, do we really need to make what has previously been an implementation detail part of the public API?
There was a problem hiding this comment.
Thanks for the quick response @alamb ! :)
Looking at the example - we have to pass a single private FileEncryptor object (it has random state) to all ArrowColumnWriters, SerializedRowGroupWriters and SerializedFileWriter. Perhaps we could hide everything into ArrowWriter. I'll have to do a quick check. If you have a cleaner idea we'd be most interested!
There was a problem hiding this comment.
I do think avoiding exposing ArrowRowGroupWriter and the associated machinery that would be good
The rationale is that @XiangpengHao and @zhuqi-lucas and myself are likely to be reworking it in the next few releases and once it is part of the public API changing it becomes harder as it requires more coordination / backwards compatibility concerns
| impl WriterPropertiesBuilder { | ||
| /// Returns default state of the builder. | ||
| fn with_defaults() -> Self { | ||
| pub fn with_defaults() -> Self { |
There was a problem hiding this comment.
We don't need to make this pub, you can call WriterProperties::builder instead. I think it's best to keep this private rather than exposing two ways to do the same thing.
There was a problem hiding this comment.
Or impl Default for WriterPropertiesBuilder?
|
Closing in favor of #8029. |
Rationale for this change
This is to enable concurrent column writing with encryption downstream (e.g. with datafusion). See #7359 for more.
See https://github.com/apache/arrow-rs/pull/7111/files#r2015196618
What changes are included in this PR?
ArrowRowGroupWriterFactory.create_row_group_writernow passes aFileEncryptortoArrowColumnWriterFactorythat can later be used to write columns concurrently.pub.Are these changes tested?
Yes.
Are there any user-facing changes?
ArrowRowGroupWriterFactory,ArrowRowGroupWriter,ArrowRowGroupWriterFactory::new,ArrowRowGroupWriterFactory.create_row_group_writer, .. are nowpub. Shouldn't be breaking but perhaps needs review.