Enable External ArrowColumnWriter Access#4859
Enable External ArrowColumnWriter Access#4859devinjdangelo wants to merge 5 commits intoapache:masterfrom
Conversation
|
|
||
| /// Write the leaves of `array` in depth-first order to `writers` with `levels` | ||
| fn write_leaves<'a, W>( | ||
| pub fn write_leaves<'a, W>( |
There was a problem hiding this comment.
Is there some way we could avoid making this and the accompanying LevelInfo structures public? I'm not really sure we want to make a hard API commitment to not changing them...
There was a problem hiding this comment.
This is a good point. I did some refactoring and created an ArrowColumnWriter::write method that abstracts these low level implementation details for external users.
| /// This allows it to be owned by lower level page writers whilst allowing access via | ||
| /// [`ArrowRowGroupWriter`] on flush, without requiring self-referential borrows | ||
| type SharedColumnChunk = Arc<Mutex<ArrowColumnChunk>>; | ||
| pub type SharedColumnChunk = Arc<Mutex<ArrowColumnChunk>>; |
There was a problem hiding this comment.
This is a bit of a hack, that I stomached largely because it was an internal implementation detail, is there some way we can keep this hidden as an implementation detail?
There was a problem hiding this comment.
I did some additional refactoring to split out the Vec<(SharedColumnChunk, ArrowColumnWriter)> into two separate Vecs. This way I could return ownership of only the ArrowColumnWriters. SharedColumnChunk is private again.
| enum ArrowColumnWriter { | ||
| /// Serializes [ArrayRef]s to [ArrowColumnChunk]s which can be concatenated | ||
| /// to form a parquet row group | ||
| pub enum ArrowColumnWriter { |
There was a problem hiding this comment.
Can we make the enum variants private as well, i.e. something like
pub struct ArrowColumnWriter(ArrowColumnWriterImpl);
enum ArrowColumnWriterImpl {
...
}
| /// Takes ownership of all [ArrowColumnWriter]s from this [ArrowRowGroupWriter] | ||
| /// Caller must restore ownership with give_col_writers before calling close method. | ||
| pub fn take_col_writers(&mut self) -> Vec<ArrowColumnWriter> { | ||
| self.writers.drain(..).collect() | ||
| } | ||
|
|
||
| /// Restores ownership of all [ArrowColumnWriter]s. Caller is responsible for | ||
| /// returning the [Vec] in the same order returned by take_col_writers method. | ||
| pub fn give_col_writers(&mut self, writers: Vec<ArrowColumnWriter>) { | ||
| self.writers = writers; | ||
| } |
There was a problem hiding this comment.
I'm not a massive fan of this API tbh, I'll have a play and see what I can come up with
There was a problem hiding this comment.
I agree. Let me know if you come up with something more elegant!
My original attempt was just to provide a mutable reference to the writers, but it is more challenging to handle mutable references safely vs. passing ownership in parallel async tasks (I considered giving https://docs.rs/async-scoped/latest/async_scoped/ a try but decided against it).
I moved away from entirely deconstructing the ArrowRowGroupWriterin order to keep the SharedColumnChunk private.
| /// Serializes an [ArrayRef] to a [ArrowColumnChunk] for an in progress row group. | ||
| pub fn write(&mut self, array: ArrayRef, field: Arc<Field>) -> Result<()> { | ||
| let mut levels = calculate_array_levels(&array, &field)?.into_iter(); | ||
| let mut writer_iter = std::iter::once(self); |
There was a problem hiding this comment.
This is actually wrong if array is nested, and therefore comprises multiple leaf columns
There was a problem hiding this comment.
Got it, makes sense. Perhaps we could have something like:
pub struct ArrowColumnWriter(Vec<ArrowColumnWriterImpl>);
enum ArrowColumnWriterImpl{
...
}which for a non nested column would contain only one ArrowColumnWriterImpl, but could hold multiple in the case of nested columns?
There was a problem hiding this comment.
I'm having a play with a slightly different API for this, will report back 😄
Which issue does this PR close?
Related to: #1718
Enables: apache/datafusion#7655
Rationale for this change
#4850 enabled external access to
ArrowRowGroupWriterso downstream users could orchestrate serialization of row groups in parallel on threads/tokio tasks as desired. This PR goes one level deeper to makeArrowColumnWriterand associated structs/functions public, so that downstream users can serialize columns in parallel.This PR also adds some utility methods to break apart and reconstruct
ArrowRowGroupWriter. The idea is to do the following:ArrowRowGroupWriterArrowColumnWriters and distribute to threads/tasksArrowRowGroupWriterand finalize the row groupThe above strategy is implemented in apache/datafusion#7655.
What changes are included in this PR?
ArrowColumnWriterand associated sturcts/functions are markedpub. Additional utility methods implemented forArrowRowGroupWriter.Are there any user-facing changes?
Additional structs and functions are public.