Parquet AsyncReader: Don't panic when empty offset_index is Some([])#6582
Parquet AsyncReader: Don't panic when empty offset_index is Some([])#6582tustvold merged 8 commits intoapache:masterfrom
Conversation
| let offset_index = self | ||
| .metadata | ||
| .offset_index() | ||
| .filter(|index| !index.is_empty()) |
There was a problem hiding this comment.
This panic happens when manually setting metadata.set_offset_index(Some(vec![])); as in the test empty_ofset_index_doesnt_panic_in_read_row_group
There was a problem hiding this comment.
It seems the synchronous file reader does a similar filtering step, so this is probably a good addition. It may not have been an issue in the past since MetadataReader will return None rather than Some([]) for missing indexes.
| Some(data) => { | ||
| let page_locations = self | ||
| .offset_index | ||
| .filter(|index| !index.is_empty()) |
There was a problem hiding this comment.
This panic happens when metadata is saved to a file and reloaded
such as in the test empty_offset_index_doesnt_panic_in_column_chunks or in the example external_metadata
| } | ||
|
|
||
| #[tokio::test] | ||
| async fn empty_ofset_index_doesnt_panic_in_read_row_group() { |
There was a problem hiding this comment.
These tests may not be well named. Feel free to suggest better ones
| .unwrap(); | ||
|
|
||
| let result = reader.try_collect::<Vec<_>>().await.unwrap(); | ||
| assert_eq!(result.len(), 8); |
There was a problem hiding this comment.
Not sure if there's a more concise+focused way to write these tests.
Ideally this assertion checks the page_locations are like this
page_locations: Some(
[
PageLocation {
offset: 4,
compressed_page_size: 109,
first_row_index: 0,
},
PageLocation {
offset: 113,
compressed_page_size: 109,
first_row_index: 21,
},
...
])and the other 2 tests assert that page_locations is Emtpy.
| } | ||
|
|
||
| #[tokio::test] | ||
| async fn empty_offset_index_doesnt_panic_in_column_chunks() { |
There was a problem hiding this comment.
This test aims to replicate the writing and reloading of metadata as is done in the external_metadata example.
This feels like a bug, possibly introduced by the recent metadata changes. An empty offset index and no offset index are semantically different things. @etseidl could you possibly take a look? |
|
Yes, as @jroddev found, at least part of this issue is tracked by #6447. The original reader for the offset index returns an empty vec if offset indexes are requested but are not actually present in the file. There is a test somewhere that actually expects this behavior (I'll search later for that). The async MetadataLoader instead leaves the offset index as None in that case.@alamb and I felt we couldn't reconcile the two until 54.0.0. As to I'll have more time this afternoon to dig into this and look over this PR. Edit: arrow-rs/parquet/src/arrow/arrow_reader/mod.rs Lines 3587 to 3589 in dd5a229 |
| } | ||
|
|
||
| #[tokio::test] | ||
| async fn non_empty_ofset_index_doesnt_panic_in_read_row_group() { |
There was a problem hiding this comment.
| async fn non_empty_ofset_index_doesnt_panic_in_read_row_group() { | |
| async fn non_empty_offset_index_doesnt_panic_in_read_row_group() { |
|
Hit submit too soon...the rub here is that we're using the synchronous metadata reader with the asynchronous file reader (which is a scenario we're trying to support). I think the workaround in this PR is correct, but will become unnecessary once the |
Co-authored-by: Ed Seidl <[email protected]>
|
I took the liberty of pushing a commit to this PR to fix CI / merge it up to I also implemented @etseidl 's suggestion in #6582 (comment) while I had it checked out. |
Which issue does this PR close?
Closes #.
Rationale for this change
If
offset_indexis loaded asSome([])then AsyncReader may panic in a couple of places. These panics do not occur whenoffset_indexisNone. Emptyoffset_indexcan end up as either depending on how its loaded.An empty
offset_indexwill switch fromNonetoSome([])after it is written to disk withParquetMetaDataWriterand then read back into memory withParquetMetaDataReader(as is done in the external_metadata example).I believe I also saw this issue on one parquet file without the roundtrip above, but I don't currently have the file to verify.
The panic can be reproduced in the external_metadata example by changing it to load alltypes_plain.parquet instead of using the one in generated in code. You will also need to remove the call to
prepare_metadataas the filesize assertion does not hold with the new data.Possible related to #6447
arrow-rs/parquet/src/file/metadata/reader.rs
Line 261 in 9485897
Sounds similar to
#4761
Came across the issue at the same time as this one #6581
With this change
page_locationswill now equalNonein both emptyoffset_indexscenarios.When
offset_indexis provided thenpage_locationsis filled as expected: