Skip to content

Conversation

@etseidl
Copy link
Contributor

@etseidl etseidl commented Aug 15, 2025

Which issue does this PR close?

Note: this targets a feature branch, not main

We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.

Rationale for this change

Speed

What changes are included in this PR?

Still a work in progress, but begins the process of converting page index parsing to the new thrift decoder.

Are these changes tested?

This PR actually uses the new decoder when parsing the page indexes using the existing machinery. As such all tests involving the page indexes should apply to this code.

Are there any user-facing changes?

Yes

@github-actions github-actions bot added the parquet Changes to the parquet crate label Aug 15, 2025
@etseidl
Copy link
Contributor Author

etseidl commented Aug 15, 2025

Quick comparison of the old and new decoders:

old
open(page index)        time:   [1.7446 ms 1.7524 ms 1.7639 ms]

new
open(page index)        time:   [698.00 µs 699.75 µs 701.66 µs]

@etseidl
Copy link
Contributor Author

etseidl commented Aug 15, 2025

I will say that the page indexes are pretty darn expensive to parse, and the file used for the benchmark (parquet-testing/data/all_types_tiny_pages.parquet) is pretty pathological. Looking into where the time goes, the offset index is hobbled by the fact that it's defined as an array of structs, which adds considerable overhead to the parsing. The column index is a struct of arrays that parses very quickly, but then must be transformed into an array of structs after decoding, so that takes a good bit of time. Copying of the min/max statistics for byte arrays takes the majority of that time (note that the test file does not contain the level histograms...those would be very costly as well if present). We could look into rethinking how we represent the column index. Perhaps saving the bytes read and presenting slices rather than copies will work (at least as far as the histograms in the column index...we may be stuck with min/max value copying).

@alamb, not sure how radical you want to go here 😅

@alamb
Copy link
Contributor

alamb commented Aug 18, 2025

I will say that the page indexes are pretty darn expensive to parse, and the file used for the benchmark (parquet-testing/data/all_types_tiny_pages.parquet) is pretty pathological. Looking into where the time goes, the offset index is hobbled by the fact that it's defined as an array of structs, which adds considerable overhead to the parsing. The column index is a struct of arrays that parses very quickly, but then must be transformed into an array of structs after decoding, so that takes a good bit of time.

What drives the need to convert to array of structs? Is that the representation of the ColumnIndex in Rust or is it something about how the thrift is encoded?

Copying of the min/max statistics for byte arrays takes the majority of that time (note that the test file does not contain the level histograms...those would be very costly as well if present). We could look into rethinking how we represent the column index. Perhaps saving the bytes read and presenting slices rather than copies will work (at least as far as the histograms in the column index...we may be stuck with min/max value copying).

As you say, perhaps we could keep around a Bytes with the byte statistics in it, and store an offset there (rather than copying into their own structure).

Maybe we could also contemplate some way to defer decoding/copying the structures out until they were requested

@alamb, not sure how radical you want to go here 😅

I have no pre-concieved ideas. I have personally always found the ColumnIndex representation in Rust (Vec<Vec<Index>> as I recall) quite complicated to work with, so if we have to change that to improve the performance I would be fully in support of it

@etseidl
Copy link
Contributor Author

etseidl commented Aug 18, 2025

What drives the need to convert to array of structs? Is that the representation of the ColumnIndex in Rust or is it something about how the thrift is encoded?

Yes...parquet-rs takes the existing ColumnIndex which is a struct of arrays, each num_pages in length, and turns that into num_pages PageIndex objects contained in a NativeIndex, which is then encapsulated in an Index enum variant. While we're remodeling we could blow that up, but I think that would have a pretty big ripple effect downstream.

As you say, perhaps we could keep around a Bytes with the byte statistics in it, and store an offset there (rather than copying into their own structure).

I'll try playing around with that and see if it helps.

Also, I think this came up before, but only materializing the column index for columns being filtered on rather than for the entire schema would certainly help. Selectively writing them would be useful as well.

@alamb
Copy link
Contributor

alamb commented Aug 18, 2025

While we're remodeling we could blow that up, but I think that would have a pretty big ripple effect downstream.

What we could do is create a new structure that is faster to decode, and then transform to the Index enum variant if requested (aka keep backwards compatibility logic for a while) 🤔

@alamb
Copy link
Contributor

alamb commented Aug 18, 2025

Also, I think this came up before, but only materializing the column index for columns being filtered on rather than for the entire schema would certainly help. Selectively writing them would be useful as well.

Yes, absolutely. Another really useful thing would be not decoding the page index / column index unless it is needed -- for example if we can prune the entire row group just with the row group statistics, we shouldn't even have to bother to decode the page index for that 🤔

}
};

// turn Option<Vec<i64>> into Vec<Option<i64>>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😢

thrift_struct!(
pub(crate) struct ColumnIndex<'a> {
1: required list<bool> null_pages
2: required list<'a><binary> min_values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so sad -- the structure actually starts out column oriented and then we pivot it to rows (just to have to pivot it back to columns to use in DataFusion)

/// Only defined for BYTE_ARRAY columns.
pub unencoded_byte_array_data_bytes: Option<Vec<i64>>,
/// Vector of [`PageLocation`] objects, one per page in the chunk.
1: required list<PageLocation> page_locations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will admit these macros are quite neat and are growing on me

@etseidl
Copy link
Contributor Author

etseidl commented Aug 20, 2025

Time to merge this one. Next steps include further speed up of the offset index decoding, and creating a new column index avatar that avoids the costly conversion to an array of structs.

@etseidl etseidl merged commit 3c353e2 into apache:gh5854_thrift_remodel Aug 20, 2025
16 checks passed
@etseidl etseidl deleted the redo_page_index branch October 10, 2025 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants