-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Closed
Labels
bugenhancementAny new improvement worthy of a entry in the changelogAny new improvement worthy of a entry in the changelognext-major-releasethe PR has API changes and it waiting on the next major versionthe PR has API changes and it waiting on the next major versionparquetChanges to the parquet crateChanges to the parquet crate
Description
Describe the bug
- As @jonded94 found in Files containing binary data with >=8_388_855 bytes per row written with
arrow-rscan't be read withpyarrow#7489 - And @etseidl debugged in Truncate Parquet page data page statistics #7555
When writing long string values into string columns in parqet, we expect the WriterProperties::max_statistics_truncate_length to be apply and reduce their size
This property currently correctly truncates statistics written to the ColumnChunkMetadata but NOT the statistics written to the data page headers.
To Reproduce
use std::io::BufWriter;
use std::sync::Arc;
use arrow::array::{ArrayRef, RecordBatch, StringViewArray};
use parquet::arrow::ArrowWriter;
use parquet::file::properties::WriterProperties;
fn main() {
let output= std::fs::File::create("output.parquet").unwrap();
let mut output = BufWriter::new(output);
let batch = make_batch('a');
let props = WriterProperties::builder()
.set_max_row_group_size(1)
.set_statistics_truncate_length(Some(64))
.build();
let mut writer = ArrowWriter::try_new(&mut output, batch.schema(), Some(props)).unwrap();
writer.write(&batch).unwrap();
for char in ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'] {
let batch = make_batch(char);
writer.write(&batch).unwrap();
}
writer.close().unwrap();
}
// Makes a batch with long string values for testing purposes.
fn make_batch(val: char) -> RecordBatch {
let col = Arc::new(StringViewArray::from_iter_values(
[val.to_string().repeat(100000)]
)) as ArrayRef;
RecordBatch::try_from_iter([("col", col)]).unwrap()
}The resulting data page headers have statistics
Expected behavior
I expect the data page headers to be truncated to 64 bytes
Additional context
Metadata
Metadata
Assignees
Labels
bugenhancementAny new improvement worthy of a entry in the changelogAny new improvement worthy of a entry in the changelognext-major-releasethe PR has API changes and it waiting on the next major versionthe PR has API changes and it waiting on the next major versionparquetChanges to the parquet crateChanges to the parquet crate