Skip to content

Expose SortingColumn in parquet files#3103

Merged
tustvold merged 9 commits intoapache:masterfrom
askoa:sorting-column
Nov 15, 2022
Merged

Expose SortingColumn in parquet files#3103
tustvold merged 9 commits intoapache:masterfrom
askoa:sorting-column

Conversation

@askoa
Copy link
Contributor

@askoa askoa commented Nov 13, 2022

Which issue does this PR close?

Closes #3090

Reading from file:

The function footer.rs#decode_metadata read sorting_column from file. However the function RowGroupMetaData::from_thrift was not reading the field into RowGroupMetaData. The function was modified to read the sorting_column into RowGroupMetaData

let t_file_metadata: TFileMetaData = TFileMetaData::read_from_in_protocol(&mut prot)
.map_err(|e| ParquetError::General(format!("Could not parse metadata: {}", e)))?;
let schema = types::from_thrift(&t_file_metadata.schema)?;
let schema_descr = Arc::new(SchemaDescriptor::new(schema));
let mut row_groups = Vec::new();
for rg in t_file_metadata.row_groups {
row_groups.push(RowGroupMetaData::from_thrift(schema_descr.clone(), rg)?);
}
let column_orders = parse_column_orders(t_file_metadata.column_orders, &schema_descr);

Writing into file:

The function format.rs#RowGroup#write_to_out_protocol writes sorting_column to file. However the function metadata.rs#RowGroupMetaData#to_thrift was not writing the field to RowGroup. The function was modified to write sorting_column from RowGroupMetaData to RowGroup

arrow-rs/parquet/src/format.rs

Lines 4155 to 4162 in 430eb84

if let Some(ref fld_var) = self.sorting_columns {
o_prot.write_field_begin(&TFieldIdentifier::new("sorting_columns", TType::List, 4))?;
o_prot.write_list_begin(&TListIdentifier::new(TType::Struct, fld_var.len() as i32))?;
for e in fld_var {
e.write_to_out_protocol(o_prot)?;
}
o_prot.write_list_end()?;
o_prot.write_field_end()?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Nov 13, 2022
@askoa askoa marked this pull request as ready for review November 13, 2022 23:57
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we get an end-to-end test of this, i.e. write a parquet file to a Vec<u8> then read it back and verify the sort column was round-tripped correctly

Comment on lines 383 to 368
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
value: Option<Vec<SortingColumnMetaData>>,
) -> Self {
self.sorting_columns = value;
value: Vec<SortingColumnMetaData>,
) -> Self {
self.sorting_columns = Some(value);

This is consistent with set_page_offset

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got a reverse review before. The reason given by the reviewer was that if we remove Option from signature then the function cannot be used to set None for this field. I am going to keep this as is.

@askoa askoa marked this pull request as draft November 14, 2022 13:57
@askoa askoa marked this pull request as ready for review November 15, 2022 01:26
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting close 😄

@tustvold tustvold mentioned this pull request Nov 15, 2022
@tustvold tustvold merged commit 371ec57 into apache:master Nov 15, 2022
@tustvold
Copy link
Contributor

Thank you 👍

@ursabot
Copy link

ursabot commented Nov 15, 2022

Benchmark runs are scheduled for baseline = 8bb2917 and contender = 371ec57. 371ec57 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expose SortingColumn when reading and writing parquet metadata

3 participants