Draft
Conversation
Goal: Introduce Vec<Vec<Option<...>>> for column & offset indexes and wire it up just enough so everything compiles and tests pass, without changing semantics more than necessary.
Normalize column index semantics: make “no column index for this column chunk” be expressed as None (not Some(NONE)), update consumers, and adjust tests accordingly.
Contributor
Contributor
joe-ucp
added a commit
to joe-ucp/arrow-rs
that referenced
this pull request
Nov 28, 2025
…y restore
Introduce Vec<Vec<Option<...>>> for page indexes and keep semantics stable.
Revert unrelated edits in arrow/array_reader, arrow/decoder, arrow/schema/extension (unused-mut), and parquet-fromcsv.
Restore From<bool> for PageIndexPolicy to pre-PR behavior (true -> Required, false -> Skip).
Parser: do not enforce Required for column index; retain enforcement for offset index only.
Writer: remove conversion helpers and map(|x| Some(...)) loops; use self.metadata.column_index() / offset_index() directly; collapse to None when all chunks are None.
Update docs for ParquetColumnIndex/ParquetOffsetIndex with Option semantics and idiomatic access examples.
Keep on-disk/footer encoding unchanged; preserve ColumnIndexMetaData::NONE behavior.
Call sites updated to handle Option (.as_ref().expect(...) where legacy invariant held; if let Some(...) otherwise).
Breaking change (58.0.0):
Public aliases now include Option at the leaf:
ParquetColumnIndex = Vec<Vec<Option<ColumnIndexMetaData>>>
ParquetOffsetIndex = Vec<Vec<Option<OffsetIndexMetaData>>>
Migration:
if let Some(idx) = indexes[rg][col].as_ref() { /* use idx */ }
Refs: apache#8818 (part), follow-ups apache#8892, apache#8893
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Page index behavior: handle partial / missing indexes without panics
Goal
Ensure Parquet page indexes behave correctly when some or all columns lack indexes by:
Noneat the leaf ofParquetColumnIndexandParquetOffsetIndexas “no index for this column chunk”Which issue does this PR close?
Closes #8818.
Rationale for this change
With the earlier PRs:
Vec<Vec<Option<...>>>NoneHowever, several call sites and higher-level APIs still effectively assumed that page indexes were present for all requested columns once page index reading was enabled. This could lead to:
unwrap()/as_ref().unwrap()panics when a specific column chunk had no indexThis PR focuses on making the runtime behavior robust for these scenarios and documenting it via tests.
What changes are included in this PR?
Safer handling of per-column
Optionpage indexesUpdate page-iterator, selection, and in-memory row-group code paths to:
NoneinParquetOffsetIndex[row_group][column]orParquetColumnIndex[row_group][column]as “no page index for this column chunk”Where
expect(...)is still used, it is reserved for true internal invariants such as:Row selection and read-plan behavior
RowSelectionand related selection helpers are updated to:&[Option<OffsetIndexMetaData>]where appropriateNoneRowSelectionPolicywhile avoiding panics for missing indexesIn-memory row group and async/arrow readers
InMemoryRowGroupnow:Option<&[Option<OffsetIndexMetaData>]>for offset indexesoffset_index[idx]Async and synchronous readers are updated to:
PageIndexPolicybehaviorWith per-column
Optionmodeling,parse_offset_indexnow:Vec<Vec<Option<OffsetIndexMetaData>>>so missing offset indexes are represented asNoneRequiredpoliciesThe practical effect is:
Nonein a defined, non-panicking wayTests for partial and missing page indexes
parquet/tests/arrow_reader/statistics.rs) that cover:Noneat the page index leafNonefor top-level page indexesAre these changes tested?
Yes.