[Parquet] Refactor InMemoryRowGroup to separate CPU and IO by alamb · Pull Request #8022 · apache/arrow-rs

alamb · 2025-07-29T22:04:45Z

Which issue does this PR close?

Part of Decouple IO and CPU operations in the Parquet Reader (push decoder) #7983

Rationale for this change

Right now the InMemoryRowGroup does both the IO and the CPU work to on a call to fetch

in #7997 I am trying to separate the two, so that we can have more control over
when IO happens and when CPU work happens. I figured I could refactor the
InMemoryRowGroup as a separate PR to make it easier to review and convince myself this refactor is correct.

What changes are included in this PR?

Refactor InMemoryRowGroup::fetch into three functions

Are these changes tested?

By existing CI

If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?

Are there any user-facing changes?

No

alamb · 2025-07-29T22:05:06Z

parquet/src/arrow/async_reader/mod.rs

I just broke up fetch into three functions

# Which issue does this PR close? - Part of apache#8000 - closes apache#7983 # Rationale for this change This PR is the first part of separating IO and decode operations in the rust parquet decoder. Decoupling IO and CPU enables several important usecases: 1. Different IO patterns (e.g. not buffer the entire row group at once) 2. Different IO APIs e.g. use io_uring, or OpenDAL, etc. 3. Deliberate prefetching within a file 4. Avoid code duplication between the `ParquetRecordBatchStreamBuilder` and `ParquetRecordBatchReaderBuilder` # What changes are included in this PR? 1. Add new `ParquetDecoderBuilder`, and `ParquetDecoder` and tests It is effectively an explicit version of the state machine that is used in existing async reader (where the state machine is encoded as Rust `async` / `await` structures) # Are these changes tested? Yes -- there are extensive tests for the new code Note that this PR actually adds a **3rd** path for control flow (when I claim this will remove duplication!) In follow on PRs I will convert the existing readers to use this new pattern, similarly to the sequence I did for the metadata decoder: - apache#8080 - apache#8340 Here is a preview of a PR that consolidates the async reader to use the push decoder internally (and removes one duplicate): - apache#8159 - closes apache#8022 # Are there any user-facing changes? Yes, a new API, but now changes to the existing APIs --------- Co-authored-by: Matthijs Brobbel <[email protected]> Co-authored-by: Adrian Garcia Badaracco <[email protected]>

github-actions bot added the parquet Changes to the parquet crate label Jul 29, 2025

alamb commented Jul 29, 2025

View reviewed changes

parquet/src/arrow/async_reader/mod.rs Outdated

Copy link

Contributor Author

alamb Jul 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just broke up fetch into three functions

[Parquet] Refactor InMemoryRowGroup to separate CPU and IO

c881d34

alamb force-pushed the alamb/split_in_memory_reader branch from 671f698 to c881d34 Compare July 30, 2025 12:22

alamb mentioned this pull request Jul 31, 2025

Implement Push Parquet Decoder #7997

Merged

alamb marked this pull request as draft August 8, 2025 19:16

alamb closed this in #7997 Oct 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[Parquet] Refactor InMemoryRowGroup to separate CPU and IO#8022

[Parquet] Refactor InMemoryRowGroup to separate CPU and IO#8022
alamb wants to merge 1 commit intoapache:mainfrom
alamb:alamb/split_in_memory_reader

alamb commented Jul 29, 2025 •

edited

Loading

Uh oh!

alamb Jul 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

alamb commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

alamb commented Jul 29, 2025 •

edited

Loading