[thrift-remodel] Decoding of page indexes #8160

etseidl · 2025-08-15T22:28:31Z

Which issue does this PR close?

Note: this targets a feature branch, not main

We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.

Part of Use custom thrift decoder to improve speed of parsing parquet metadata #5854.

Rationale for this change

Speed

What changes are included in this PR?

Still a work in progress, but begins the process of converting page index parsing to the new thrift decoder.

Are these changes tested?

This PR actually uses the new decoder when parsing the page indexes using the existing machinery. As such all tests involving the page indexes should apply to this code.

Are there any user-facing changes?

Yes

to our own metadata structs. This reduces the amount of non-macro generated code.

…_filemetadata

etseidl · 2025-08-15T22:32:11Z

Quick comparison of the old and new decoders:

old
open(page index)        time:   [1.7446 ms 1.7524 ms 1.7639 ms]

new
open(page index)        time:   [698.00 µs 699.75 µs 701.66 µs]

etseidl · 2025-08-15T22:38:40Z

I will say that the page indexes are pretty darn expensive to parse, and the file used for the benchmark (parquet-testing/data/all_types_tiny_pages.parquet) is pretty pathological. Looking into where the time goes, the offset index is hobbled by the fact that it's defined as an array of structs, which adds considerable overhead to the parsing. The column index is a struct of arrays that parses very quickly, but then must be transformed into an array of structs after decoding, so that takes a good bit of time. Copying of the min/max statistics for byte arrays takes the majority of that time (note that the test file does not contain the level histograms...those would be very costly as well if present). We could look into rethinking how we represent the column index. Perhaps saving the bytes read and presenting slices rather than copies will work (at least as far as the histograms in the column index...we may be stuck with min/max value copying).

@alamb, not sure how radical you want to go here 😅

alamb · 2025-08-18T14:48:36Z

I will say that the page indexes are pretty darn expensive to parse, and the file used for the benchmark (parquet-testing/data/all_types_tiny_pages.parquet) is pretty pathological. Looking into where the time goes, the offset index is hobbled by the fact that it's defined as an array of structs, which adds considerable overhead to the parsing. The column index is a struct of arrays that parses very quickly, but then must be transformed into an array of structs after decoding, so that takes a good bit of time.

What drives the need to convert to array of structs? Is that the representation of the ColumnIndex in Rust or is it something about how the thrift is encoded?

Copying of the min/max statistics for byte arrays takes the majority of that time (note that the test file does not contain the level histograms...those would be very costly as well if present). We could look into rethinking how we represent the column index. Perhaps saving the bytes read and presenting slices rather than copies will work (at least as far as the histograms in the column index...we may be stuck with min/max value copying).

As you say, perhaps we could keep around a Bytes with the byte statistics in it, and store an offset there (rather than copying into their own structure).

Maybe we could also contemplate some way to defer decoding/copying the structures out until they were requested

@alamb, not sure how radical you want to go here 😅

I have no pre-concieved ideas. I have personally always found the ColumnIndex representation in Rust (Vec<Vec<Index>> as I recall) quite complicated to work with, so if we have to change that to improve the performance I would be fully in support of it

etseidl · 2025-08-18T16:57:30Z

What drives the need to convert to array of structs? Is that the representation of the ColumnIndex in Rust or is it something about how the thrift is encoded?

Yes...parquet-rs takes the existing ColumnIndex which is a struct of arrays, each num_pages in length, and turns that into num_pages PageIndex objects contained in a NativeIndex, which is then encapsulated in an Index enum variant. While we're remodeling we could blow that up, but I think that would have a pretty big ripple effect downstream.

As you say, perhaps we could keep around a Bytes with the byte statistics in it, and store an offset there (rather than copying into their own structure).

I'll try playing around with that and see if it helps.

Also, I think this came up before, but only materializing the column index for columns being filtered on rather than for the entire schema would certainly help. Selectively writing them would be useful as well.

alamb · 2025-08-18T17:10:03Z

While we're remodeling we could blow that up, but I think that would have a pretty big ripple effect downstream.

What we could do is create a new structure that is faster to decode, and then transform to the Index enum variant if requested (aka keep backwards compatibility logic for a while) 🤔

alamb · 2025-08-18T17:12:32Z

Also, I think this came up before, but only materializing the column index for columns being filtered on rather than for the entire schema would certainly help. Selectively writing them would be useful as well.

Yes, absolutely. Another really useful thing would be not decoding the page index / column index unless it is needed -- for example if we can prune the entire row group just with the row group statistics, we shouldn't even have to bother to decode the page index for that 🤔

alamb · 2025-08-18T17:44:39Z

parquet/src/file/page_index/index.rs

+            }
+        };
+
+        // turn Option<Vec<i64>> into Vec<Option<i64>>


alamb · 2025-08-18T17:45:33Z

parquet/src/file/page_index/index_reader.rs

+thrift_struct!(
+pub(crate) struct ColumnIndex<'a> {
+  1: required list<bool> null_pages
+  2: required list<'a><binary> min_values


This is so sad -- the structure actually starts out column oriented and then we pivot it to rows (just to have to pivot it back to columns to use in DataFusion)

alamb · 2025-08-18T17:46:09Z

parquet/src/file/page_index/offset_index.rs

-    /// Only defined for BYTE_ARRAY columns.
-    pub unencoded_byte_array_data_bytes: Option<Vec<i64>>,
+  /// Vector of [`PageLocation`] objects, one per page in the chunk.
+  1: required list<PageLocation> page_locations


I will admit these macros are quite neat and are growing on me

etseidl · 2025-08-20T19:12:01Z

Time to merge this one. Next steps include further speed up of the offset index decoding, and creating a new column index avatar that avoids the costly conversion to an array of structs.

etseidl and others added 28 commits August 11, 2025 15:29

first cut at reading straight to ParquetMetaData

502c679

metadata got a little smaller

e6cfce1

use macro for SchemaElement

6072693

make names a bit clearer

575a7cf

refactor schema_from_thrift_input

434c5e5

add first test

0c3a5b0

make file_path a slice

ae069af

add macro for mixed unions

c9284f2

cleanup and fix union bug

52ad234

list changes and move OrderedF64

cba0561

oops

6c5d58f

replace KeyValue with macro

36450c5

Go back to first fully decoding thrift, and then transforming

b24b358

to our own metadata structs. This reduces the amount of non-macro generated code.

clippy

cec9815

start moving generated structs to their own module

6538dd4

Merge remote-tracking branch 'origin/main' into redo_filemetadata

8b1cfdb

Merge remote-tracking branch 'origin/gh5854_thrift_remodel' into redo…

4fa5167

…_filemetadata

fix some comments

4a1384e

add some explanation and todos

e497133

add try_from for ParquetMetaData

8a26462

oof

cefeba4

take ownership of schema elements to avoid clone

d0b6ffc

allow for list of binary with lifetime

390e883

checkpoint

03dbb87

Merge branch 'gh5854_thrift_remodel' into redo_page_index

f78f39e

faster conversion to Index

4a1d63d

allow dead code

d7430d8

some clippy fixes

1122fa0

github-actions bot added the parquet Changes to the parquet crate label Aug 15, 2025

alamb approved these changes Aug 18, 2025

View reviewed changes

etseidl added 2 commits August 18, 2025 15:57

allow for docstrings on thrift enums

3e9e344

allow deprecated enum variants

7d311de

etseidl merged commit 3c353e2 into apache:gh5854_thrift_remodel Aug 20, 2025
16 checks passed

etseidl mentioned this pull request Aug 20, 2025

[thrift-remodel] Add custom PageLocation decoder to speed up decoding of page indexes #8190

Merged

etseidl deleted the redo_page_index branch October 10, 2025 14:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[thrift-remodel] Decoding of page indexes #8160

[thrift-remodel] Decoding of page indexes #8160

Uh oh!

etseidl commented Aug 15, 2025

Uh oh!

etseidl commented Aug 15, 2025

Uh oh!

etseidl commented Aug 15, 2025

Uh oh!

alamb commented Aug 18, 2025

Uh oh!

etseidl commented Aug 18, 2025

Uh oh!

alamb commented Aug 18, 2025

Uh oh!

alamb commented Aug 18, 2025

Uh oh!

alamb Aug 18, 2025

Uh oh!

alamb Aug 18, 2025

Uh oh!

alamb Aug 18, 2025

Uh oh!

etseidl commented Aug 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[thrift-remodel] Decoding of page indexes #8160

[thrift-remodel] Decoding of page indexes #8160

Uh oh!

Conversation

etseidl commented Aug 15, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

etseidl commented Aug 15, 2025

Uh oh!

etseidl commented Aug 15, 2025

Uh oh!

alamb commented Aug 18, 2025

Uh oh!

etseidl commented Aug 18, 2025

Uh oh!

alamb commented Aug 18, 2025

Uh oh!

alamb commented Aug 18, 2025

Uh oh!

alamb Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

etseidl commented Aug 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants