Skip to content

API to copy an existing RowGroup, including metadata from one parquet file to another #4823

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
In DataFusion, @devinjdangelo is using the append_column API to write parquet files in parallel (apache/datafusion#7562)

However, when trying to copy the RowGroupMetadata to the API to copy any bloom filters / page offsets, or others is awkward

Describe the solution you'd like

I would like a way to to call the append_column api given a RowGroupMetaData object from the existing file

Ideally there would be an API that produced a ColumnCloseResult from a RowGroupMetaData or some convenience API that took a reader + RowGroupMetadata from another file and did the necessary copy

Perhaps something like

impl SerializedRowGroupWriter {
...
  /// appends an entire RowGroup from the specified reader, including all
  /// metadata, to the in progress parquet file. 
  pub fn append_row_group(&mut self, rg: Box<dyn RowGroupReader>) -> Result<...> { 
   ...
  }
}

https://docs.rs/parquet/latest/parquet/file/writer/struct.SerializedRowGroupWriter.html#method.append_column

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelogparquetChanges to the parquet crate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions