-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Closed
Labels
enhancementAny new improvement worthy of a entry in the changelogAny new improvement worthy of a entry in the changelogparquetChanges to the parquet crateChanges to the parquet crateperformance
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I ran into this while working on the benchmark here
I noticed a substantial amount of time (15% of the overall time) in the benchmark was spent in convert_row_groups
arrow-rs/parquet/src/file/metadata/thrift_gen.rs
Lines 247 to 257 in b4b4d26
| fn convert_row_groups( | |
| mut row_groups: Vec<RowGroup>, | |
| schema_descr: Arc<SchemaDescriptor>, | |
| ) -> Result<Vec<RowGroupMetaData>> { | |
| let mut res: Vec<RowGroupMetaData> = Vec::with_capacity(row_groups.len()); | |
| for rg in row_groups.drain(0..) { | |
| res.push(convert_row_group(rg, schema_descr.clone())?); | |
| } | |
| Ok(res) | |
| } |
Describe the solution you'd like
I think that code could likely be optimized.
Describe alternatives you've considered
Two obvious candidates:
- Use the
into_iter()/collectpattern to map the results (which is highly optimized in Rust) - Don't clone the
Arc<SchemaDescriptor>-- I think it only needs a reference
Another thing would be to decode directly into RowGroupMetaData somehow (maybe make RowGRoupMetaData a view on an inner RowGroup 🤔
struct RowGroupMetaData {
inner: RowGroup
}Additional context
Metadata
Metadata
Assignees
Labels
enhancementAny new improvement worthy of a entry in the changelogAny new improvement worthy of a entry in the changelogparquetChanges to the parquet crateChanges to the parquet crateperformance