Fix empty offset index for all null columns (#4459)#4460
Fix empty offset index for all null columns (#4459)#4460tustvold merged 1 commit intoapache:masterfrom
Conversation
|
cc @AdamGS |
alamb
left a comment
There was a problem hiding this comment.
I reviewed the code and test in this PR carefully and it looks good to me
I will also attempt to verify it fixes out actual data problem as well using some internal test data
|
|
||
| writer.close().unwrap(); | ||
| } | ||
|
|
There was a problem hiding this comment.
I verified that without the code change in this PR, this test fails like this:
failures:
---- arrow::arrow_writer::tests::test_writer_all_null stdout ----
thread 'arrow::arrow_writer::tests::test_writer_all_null' panicked at 'assertion failed: `(left == right)`
left: `0`,
right: `1`', parquet/src/arrow/arrow_writer/mod.rs:2637:9
stack backtrace:
0: rust_begin_unwind
at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/panicking.rs:578:5
1: core::panicking::panic_fmt
at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/panicking.rs:67:14
2: core::panicking::assert_failed_inner
at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/panicking.rs:274:17
3: core::panicking::assert_failed
at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/panicking.rs:228:5
4: parquet::arrow::arrow_writer::tests::test_writer_all_null
at ./src/arrow/arrow_writer/mod.rs:2637:9
5: parquet::arrow::arrow_writer::tests::test_writer_all_null::{{closure}}
at ./src/arrow/arrow_writer/mod.rs:2615:31
6: core::ops::function::FnOnce::call_once
at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/ops/function.rs:250:5
7: core::ops::function::FnOnce::call_once
at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
There was a problem hiding this comment.
I have verified that using master and 42.0.0, when I a parquet file of our internal data, the file errors on read with Parquet error: StructArrayReader out of sync in read_records, expected 0 skipped, got 11:
$ datafusion-cli -c "SELECT xxx, time FROM '/tmp/foo.parquet' WHERE 1684850057953220316 <= time::bigint";
DataFusion CLI v27.0.0
Arrow error: External error: Arrow: Parquet argument error: Parquet error: StructArrayReader out of sync in read_records, expected 0 skipped, got 11
When I use the code in this PR or 41.0.0 to write out the same data, I can query the resulting parquet file successfully
$ datafusion-cli -c "SELECT xxx, time FROM '/tmp/foo.parquet' WHERE 1684850057953220316 <= time::bigint";
DataFusion CLI v27.0.0
+-------+---------------------+
| xxx | time |
+-------+---------------------+
(there is data here)
Thus I am quite confident that this is the root cause and fix.
Thank you for the quick turnaround @tustvold
Test Program
fn main() {
// read data from parquet and write it back
let path = "/path/to/input/data.parquet";
println!("reading from {path}");
let file = File::open(path).unwrap();
let builder = ParquetRecordBatchReaderBuilder::try_new(file).unwrap();
let mut reader = builder.build().unwrap();
// rewrite the data to /tmp/foo.parquet
let path = "/tmp/foo.parquet";
println!("writing to {path}");
let file = File::create(path).unwrap();
let mut writer = ArrowWriter::try_new(file, reader.schema(), None).unwrap();
while let Some(record_batch) = reader.next().transpose().unwrap() {
println!("Read {} records.", record_batch.num_rows());
writer.write(&record_batch).expect("Writing batch");
}
// writer must be closed to write footer
writer.close().unwrap();
}
Which issue does this PR close?
Closes #4459 (fixes a bug introduced by #4389 in
42.0.0)Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?