#8029 introduced ArrowWriter.get_column_writers to expose Vec<ArrowColumnWriter> of a the "in progress" ArrowRowGroupWriter. This was to enable downstream libraries to concurrently write columns and row groups. However only one ArrowRowGroupWriter will exist at a time and all ArrowColumnWriters need to complete before a new RowGroup can proceed to be serialized. This can be solved with locking but is not ideal. See apache/datafusion#16738 (comment).
We could:
- Have downstream users locking and only serialize one RowGroup at a time.
- Have
ArrowWriter keep a Vec<ArrowRowGroupWriter> for all RowGroups currently being serialized.
- Expose
ArrowRowGroupWriterFactory of active ArrowWriter
Additionally we should introduce a write_parquet_with_small_rg_size with encryption to sufficiently test this codepath.