Conversation
alamb
left a comment
There was a problem hiding this comment.
Thank you for this @lilianm -- I think this idea seems reasonable to me
If we want to make this a public API, I think we should add some more documentation -- specifically, can we please add a doc test that shows how a user will use the ArrowRowGroupWriter?
Specifically, I am thinking about something like this https://docs.rs/parquet/latest/parquet/arrow/arrow_writer/struct.ArrowColumnWriter.html#example-encoding-two-arrow-arrays-in-parallel
|
Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look |
|
I came across this PR via #8162. We'd initially wanted to expose And is exposing |
In my opinion the best thing we can do is to write up some examples showing how to write parquet using multiple cores / threads, which will help guide the API design This is what I was getting at above with asking for doc examples. Basically, if we are going to add new APIs, we should also have examples showing how they are used which will have the double benefit of
|
|
After more careful review of #8162, I think it should enable all the necessary APIs for multi-threaded writing of parquet with encryption |
|
@alamb @adamreeve thanks for return i think ticket #8162 it's better approch for me. I will review it and add feedback on it. @alamb I agree to improve document about multi core/thread writing. I everybody it's agree to close this ticket and to concentrate effort on ticket #8162 |
|
Sorry i read to fast #8162 I think it's better way to expose 'ArrowRowGroupWriter' and add function And for ticket #8162 expose |
Sorry I put a response to this on a different PR: #8162 (comment) Basically, I am not sure that Given how much effort we go through in arrow to keep the API stable, I am hesitant to add anything more to the API than necessary. I think we can make |
|
For completeness, here's a parallelized (over columns and row groups) encrypted parquet writer in data fusion PR. It uses |
Which issue does this PR close?
Rationale for this change
Use ArrowRowGroupWriter helper class for write row group when you use API get_column_writers / append_row_group in ArrowWriter implemented in issue
What changes are included in this PR?
Set public ArrowRowGroupWriter and move memory_size, get_estimated_total_bytes and rows_count from ArrowWriter
Are these changes tested?
Yes
Are there any user-facing changes?
Yes add function in ArrowRowGroupWriter and expose it