Skip to content

Comments

Parquet AsyncReader: Don't panic when empty offset_index is Some([])#6582

Merged
tustvold merged 8 commits intoapache:masterfrom
jroddev:jroddev/offset_index_panic
Oct 30, 2024
Merged

Parquet AsyncReader: Don't panic when empty offset_index is Some([])#6582
tustvold merged 8 commits intoapache:masterfrom
jroddev:jroddev/offset_index_panic

Conversation

@jroddev
Copy link
Contributor

@jroddev jroddev commented Oct 18, 2024

Which issue does this PR close?

Closes #.

Rationale for this change

If offset_index is loaded as Some([]) then AsyncReader may panic in a couple of places. These panics do not occur when offset_index is None. Empty offset_index can end up as either depending on how its loaded.

thread 'arrow::async_reader::tests::my_test' panicked at parquet/src/arrow/async_reader/mod.rs:832:34:
index out of bounds: the len is 0 but the index is 0
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

An empty offset_index will switch from None to Some([]) after it is written to disk with ParquetMetaDataWriter and then read back into memory with ParquetMetaDataReader (as is done in the external_metadata example).

I believe I also saw this issue on one parquet file without the roundtrip above, but I don't currently have the file to verify.

The panic can be reproduced in the external_metadata example by changing it to load alltypes_plain.parquet instead of using the one in generated in code. You will also need to remove the call to prepare_metadata as the filesize assertion does not hold with the new data.

    - let parquet_path = create_parquet_file(&tempdir);
    + let testdata = arrow::util::test_util::parquet_test_data();
    + let parquet_path = format!("{testdata}/alltypes_plain.parquet");
    ...
    - let metadata = prepare_metadata(metadata);

Possible related to #6447

// FIXME: there are differing implementations in the case where page indexes are missing

Sounds similar to
#4761

Came across the issue at the same time as this one #6581

With this change page_locations will now equal None in both empty offset_index scenarios.
When offset_index is provided then page_locations is filled as expected:

page_locations: Some(
    [
        PageLocation {
            offset: 4,
            compressed_page_size: 109,
            first_row_index: 0,
        },
        PageLocation {
            offset: 113,
            compressed_page_size: 109,
            first_row_index: 21,
        },
       ...
   ])

@github-actions github-actions bot added the parquet Changes to the parquet crate label Oct 18, 2024
let offset_index = self
.metadata
.offset_index()
.filter(|index| !index.is_empty())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This panic happens when manually setting metadata.set_offset_index(Some(vec![])); as in the test empty_ofset_index_doesnt_panic_in_read_row_group

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the synchronous file reader does a similar filtering step, so this is probably a good addition. It may not have been an issue in the past since MetadataReader will return None rather than Some([]) for missing indexes.

Some(data) => {
let page_locations = self
.offset_index
.filter(|index| !index.is_empty())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This panic happens when metadata is saved to a file and reloaded
such as in the test empty_offset_index_doesnt_panic_in_column_chunks or in the example external_metadata

}

#[tokio::test]
async fn empty_ofset_index_doesnt_panic_in_read_row_group() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests may not be well named. Feel free to suggest better ones

.unwrap();

let result = reader.try_collect::<Vec<_>>().await.unwrap();
assert_eq!(result.len(), 8);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if there's a more concise+focused way to write these tests.
Ideally this assertion checks the page_locations are like this

page_locations: Some(
    [
        PageLocation {
            offset: 4,
            compressed_page_size: 109,
            first_row_index: 0,
        },
        PageLocation {
            offset: 113,
            compressed_page_size: 109,
            first_row_index: 21,
        },
       ...
   ])

and the other 2 tests assert that page_locations is Emtpy.

}

#[tokio::test]
async fn empty_offset_index_doesnt_panic_in_column_chunks() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test aims to replicate the writing and reloading of metadata as is done in the external_metadata example.

@tustvold
Copy link
Contributor

tustvold commented Oct 18, 2024

Empty offset_index can end up as either depending on how its loaded.

This feels like a bug, possibly introduced by the recent metadata changes. An empty offset index and no offset index are semantically different things.

@etseidl could you possibly take a look?

@etseidl
Copy link
Contributor

etseidl commented Oct 18, 2024

Yes, as @jroddev found, at least part of this issue is tracked by #6447. The original reader for the offset index returns an empty vec if offset indexes are requested but are not actually present in the file.

None => return Ok(vec![]),

There is a test somewhere that actually expects this behavior (I'll search later for that). The async MetadataLoader instead leaves the offset index as None in that case.

@alamb and I felt we couldn't reconcile the two until 54.0.0.

As to ParquetMetadataWriter, I'm honestly not sure what happened in the past when the offset index was Some([]), so I'll do some digging there. It's possible there was a behavior change there.

I'll have more time this afternoon to dig into this and look over this PR.

Edit:
Found the relevant test:

// FIXME: this test will fail when metadata parsing returns `None` for missing page
// indexes. https://github.com/apache/arrow-rs/issues/6447
assert!(builder.metadata().offset_index().unwrap()[0].is_empty());

}

#[tokio::test]
async fn non_empty_ofset_index_doesnt_panic_in_read_row_group() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
async fn non_empty_ofset_index_doesnt_panic_in_read_row_group() {
async fn non_empty_offset_index_doesnt_panic_in_read_row_group() {

@etseidl
Copy link
Contributor

etseidl commented Oct 18, 2024

Hit submit too soon...the rub here is that we're using the synchronous metadata reader with the asynchronous file reader (which is a scenario we're trying to support). I think the workaround in this PR is correct, but will become unnecessary once the ParquetMetaDataReader properly returns None rather than Some([]) when page indexes are not present. But it's likely good to have the extra sanity checks this PR provides.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jroddev and @etseidl for the review.

This looks good to me

@alamb
Copy link
Contributor

alamb commented Oct 21, 2024

I took the liberty of pushing a commit to this PR to fix CI / merge it up to master

I also implemented @etseidl 's suggestion in #6582 (comment) while I had it checked out.

@tustvold tustvold merged commit 56d4713 into apache:master Oct 30, 2024
@jroddev jroddev deleted the jroddev/offset_index_panic branch November 4, 2024 09:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants