Skip to content

Comments

GH-39392: [C++][Parquet] Support page pruning#39393

Open
huberylee wants to merge 3 commits intoapache:mainfrom
huberylee:support-page-pruning
Open

GH-39392: [C++][Parquet] Support page pruning#39393
huberylee wants to merge 3 commits intoapache:mainfrom
huberylee:support-page-pruning

Conversation

@huberylee
Copy link
Contributor

@huberylee huberylee commented Dec 29, 2023

Rationale for this change

FileReader supports reading data based on specified RowRanges to provide the most fundamental Filter pushdown capability to various upper-level computing engines. More details please refer to #39392.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@huberylee huberylee requested a review from wgtmac as a code owner December 29, 2023 18:51
@github-actions
Copy link

⚠️ GitHub issue #39392 has been automatically assigned in GitHub to PR creator.

@huberylee
Copy link
Contributor Author

@mapleFU Would you mind take a look? This merge request fully implements what I commented in document https://docs.google.com/document/d/1SeVcYudu6uD9rb9zRAnlLGgdauutaNZlAaS0gVzjkgM

@mapleFU
Copy link
Member

mapleFU commented Jan 2, 2024

This patch is extremly huge...At first glance it satisfy the interface, but I need time to reviewing it

/// returned by the Read interface is processed completely before calling the
/// Read interface again. Otherwise, fatal errors may occur due to data in the
/// Buffer being overwritten.
class ARROW_EXPORT ChunkBufferedInputStream : public InputStream {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the exactly difference between this and CachedFileInputStream? Can we just implement this on it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean ReadRangeCache or BufferedInputStream ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I mean ReadRangeCache. IMO RangeCache would be benifit from existing code and Wait interface for async read

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use ReadRangeCache, we need to differentiate between scenarios where data is read with RowRange and scenarios where data is read without RowRange.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, this seems to replace the ::arrow::io::BufferedInputStream. there're two possible cases:

  1. pre_buffer = True. So, this re-encapsulate a io-coalapse above ::arrow::io::BufferedInputStream
  2. pre_buffer = False, it just do single col io

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emm in current code, is buffered_stream_enabled_ and pre_buffer related? Maybe I forgot this 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emm in current code, is buffered_stream_enabled_ and pre_buffer related? Maybe I forgot this 🤔

Sort of... iirc the properties_.GetStream() won't be invoked for prebuffered column chunks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I forgot this:

    if (cached_source_ && prebuffered_column_chunks_bitmap_ != nullptr &&
        ::arrow::bit_util::GetBit(prebuffered_column_chunks_bitmap_->data(), i)) {
      // PARQUET-1698: if read coalescing is enabled, read from pre-buffered
      // segments.
      PARQUET_ASSIGN_OR_THROW(auto buffer, cached_source_->Read(col_range));
      stream = std::make_shared<::arrow::io::BufferReader>(buffer);
    } else {
      stream = properties_.GetStream(source_, col_range.offset, col_range.length);
    }

Thanks for point out!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When pre_buffer = true, ReadRangeCache will be used to perform IO coalescence between columns, only when pre_buffer = false and buffered_stream_enabled_ = true, BufferedInputStream will be used to do single column IO. ChunkBufferedInputStream does the same thing with BufferedInputStream, while also providing simple IO merging capabilities, and shielding the upper layer from the discontinuity of IO.

Do you have a plan to implement prebuffer feature when ReadRanges are provided? It seems that we are still uncapable of doing this, which I think it is worth doing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When pre_buffer = true, ReadRangeCache will be used to perform IO coalescence between columns, only when pre_buffer = false and buffered_stream_enabled_ = true, BufferedInputStream will be used to do single column IO. ChunkBufferedInputStream does the same thing with BufferedInputStream, while also providing simple IO merging capabilities, and shielding the upper layer from the discontinuity of IO.

Do you have a plan to implement prebuffer feature when ReadRanges are provided? It seems that we are still uncapable of doing this, which I think it is worth doing.

ReadRangeCache seems to meet the capabilities you mentioned, right?

return Status::OK();
}

read_gaps_.clear();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is triggered once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be triggered every time before actual IO occurs.

Copy link
Member

@mapleFU mapleFU Jan 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I got this, it maintaining a current buffer and do range read when NextPage is called...and Peak / Read will be triggered in this case

// set byte_width to the length of "fixed": 10
// todo: find a way to generate test data with more diversity.
BuilderType builder(::arrow::fixed_size_binary(5));
const int byte_width = 10;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm whats the purpose here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm whats the purpose here?

Just to be consistent with NullableArray for fixed_size_binary.

// The num of rows to skip before reading each row range
std::vector<int64_t> skip_row_nums_;

// The last row index for echo row range, start counting from within the page
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "echo row range" mean, is it "each row range"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "echo row range" mean, is it "each row range"?

Yes.

Comment on lines +1009 to +1010
// Skip info for current page
PageSkipInfo* skip_info_{nullptr};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the page is v1 without page-index, what would this being?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the page is v1 without page-index, what would this being?

In our internal system, this cannot happen. It may indeed happen in this place, but the way to deal with it is very simple, we only need to fallback to the original logical when invoking NextChunk.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO fallback or checking is important. Some cases might likely to read external table or disable column statistics in some wide table

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In current implementation, we only check it in GetColumnPageReader under debug mode. Fallback and more checks will be added later.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 2, 2024
if (read_states != nullptr) {
PageSkipInfo& skip_info = read_states->NextPageSkipInfo();
const int64_t records_to_skip = skip_info.SkipRowNum();
const int64_t skipped_records = SkipRecords(records_to_skip);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, doing SkipRecords in HasNextInternal would be a bit tricky...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, doing SkipRecords in HasNextInternal would be a bit tricky...

Maybe we can implement the row skipping logic in ReadRecords.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to document clearly how (or if?) the data page filter and the page pruning can work with each other??

Copy link
Member

@mapleFU mapleFU Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is data_page_filter used in this patch? And how would a page being skipped here?

Copy link
Contributor

@jp0317 jp0317 Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the ReadNextPage may use data page filter. Assuming users set both filter and row ranges. Would SkipRecords(records_to_skip) skip wrong records?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems multiple machanism is applied here...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jp0317 Should we combine ShouldSkipPage with PageIndex? Our current codebase relies something weird here. 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to document clearly how (or if?) the data page filter and the page pruning can work with each other??

data_page_filter is used to filter out non-hit pages. In the current implementation, SkipRecords is used to skip unnecessary rows within the hit pages, they do not conflict with each other. Using with data_page_filter is not currently supported, but such scenarios can be supported with very minor modifications to NextPage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jp0317 Should we combine ShouldSkipPage with PageIndex? Our current codebase relies something weird here. 🤔

It seemsShouldSkipPage targets page header statistics..iiuc we might want two version of skipping page (rather than combining into one): one based on PageIndex before actually IO on that page, one based on page header after IO is done.

I think we'd need more clearer docs on those skip-related apis as currently it seems they are supposed to be used in a certain way but not stated in the doc..

return bytes_for_values;
}

// Two parts different from original HasNextInternal:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would a SkipRecords(k) do? During SkipRecords, HasNext might also being called, would a hidden HasNextInternal() skip unexpected rows?

Copy link
Contributor Author

@huberylee huberylee Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Within a page, the hit rows may not be contiguous. For a page that has already been hit, any non-hit rows within it need to be skipped using SkipRecords. It may be clearer to look at this logic in conjunction with code comments column_reader.h:164~173:
//
//   | <--------------------------- column chunk -------------------------------> |
//             | <-------------------- page N -----------------------> |
//        first_row_idx                                          last_row_idx
//   |-- ... --|-------------------------------------------------------|--- ... ---|
//                       |---- range0 ----|         |---- range1 ----|
//             |--skip0--|                |--skip1--|
//             |------last_row_index0-----|
//             |-------------------last_row_index1-------------------|
//
  • During the execution of the SkipRecords method, it does indeed call the HasNextInternal method. The current implementation and testing have covered such scenarios, ensuring that no errors occur.

Copy link
Member

@mapleFU mapleFU Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I mean it limits the usage of TypedColumnReader, and only allow internal skip. External skip would introduce inconsistent skipping. In current-case, Skip(skip_last) would skip more than skip_last.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I get to understand this. SkipRecords() now means "trying to skip the existing records and skip any remaing data"...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I mean it limits the usage of TypedColumnReader, and only allow internal skip. External skip would introduce inconsistent skipping. In current-case, Skip(skip_last) would skip more than skip_last.

First of all, it is strange to execute SkipRecords on the basis of hit lines. Secondly, whether it is consistent depends on how to understand the semantics of SkipRecords. If some lines are skipped on the basis of hit lines, the current implementation can theoretically guarantee consistency, but more tests need to be added for verification; If SkipRecords is for all rows in page, then the existing implementation will indeed have problems.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, it is strange to execute SkipRecords on the basis of hit lines.

I think if Parquet user is using a dense read with selection vector, or doing something like incremental filtering, it's possible to do this. Since SkipRecords is exported, and it's used when skipping the records within page, I think the interface is a bit weird🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, it is strange to execute SkipRecords on the basis of hit lines.

I think if Parquet user is using a dense read with selection vector, or doing something like incremental filtering, it's possible to do this. Since SkipRecords is exported, and it's used when skipping the records within page, I think the interface is a bit weird🤔

I get your point, maybe this is just a code implementation problem, we can perform row skipping operations in ReadRecords. What do you think?

@emkornfield
Copy link
Contributor

I agree with @mapleFU that this PR is very large. I'd recommend at least breaking up the components added to Arrow from the parquet changes (and possibly splitting out the metrics class as these seem like core orthogonal concepts that should likely get a closer revierw).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this probably belongs in util?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it also doesn't seem like it has tests associated with it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it also doesn't seem like it has tests associated with it?

These stats are not relevant to this MR and I will remove it.

return ss.str();
}

static constexpr Range EMPTY_RANGE{std::numeric_limits<int64_t>::max(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be something like {0, 0}, as it stands now it is an invalid range? Also kEmptyRange I think is preferred in new code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be something like {0, 0}, as it stands now it is an invalid range

EMPTY_RANGE represents InvalidRange or EmptyRange in different scenes. {0, 0} means hit the first row, so we can't use it to mean empty range. kEmptyRange is more appropriate, I will modify it later.

// specific language governing permissions and limitations
// under the License.

#pragma once
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@wgtmac wgtmac Jan 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huberylee Do you mind taking a look at this patch: #38867? It does similar feature but use a different approach discussed in the doc without optimizing I/O and would be good to consolidate the effort with yours.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall this doesn't appear to be updated to reflect the interfaces discussed in https://docs.google.com/document/d/1SeVcYudu6uD9rb9zRAnlLGgdauutaNZlAaS0gVzjkgM/edit ?

I have already started working on it during the documentation discussion, so the implementation in this MR will be a little different from the documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huberylee Do you mind taking a look at this patch: #38867? It does similar feature but use a different approach discussed in the doc without optimizing I/O and would be good to consolidate the effort with yours.

I'm pleasure to do it. I will look at the patch: #38867 later.

@emkornfield
Copy link
Contributor

I commented on specific sections but looking over the PR it seems quite different then what was discussed in https://docs.google.com/document/d/1SeVcYudu6uD9rb9zRAnlLGgdauutaNZlAaS0gVzjkgM/edit I'm not sure I understand what is the cause of the discrepencies?

@wgtmac
Copy link
Member

wgtmac commented Jan 6, 2024

Sorry for chiming in not sooner. I agree that it would be good to break down this PR into smaller ones. Otherwise, it would be challenging to get properly reviewed.

To provide some context, this doc was originally inspired by offline discussion from @mapleFU @baibaichen @binmahone and me and further reviewed by @emkornfield @fatemehp. We have somewhat reached a consensus on the API and need more discussion on concrete PRs to refine the implementation.

@huberylee I have seen your detailed comment and replied to some of them in that doc. It would be good to go through the doc before heading to the implementation. Otherwise it might waste time of you and people involved in the original discussion.

I will take a look at this PR over the weekend and see if we can be on the same page quickly.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at the change on the reader side yet, will take another pass this weekend.

/// distance between them is less than io_merge_threshold, and the actual size
/// in one io request will be limited by the buffer_size.
///
/// \attention It is important to note that since the data returned by the Read
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks the contract of the InputStream interface.


// At the beginning of a new read range and required bytes is more than
// read range length, we think it's ok because there are some tentative
// read requests when getting next page data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is for checking the record boundary of a repeated column?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is for checking the record boundary of a repeated column?

No. In SerializedPageReader::NextPage, it try to read 16KB, 32KB, 64KB...., util the read data container whole page header. In an actual scenario, the 16KB read in the first attempt may have exceeded the size of the entire Page.

return Status::OK();
}

if (read_ranges_idx_ != read_ranges_.size()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not checking this before line 230?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not checking this before line 230?

We can not check this before line 230 for there may be more ranges to read.

/// returned by the Read interface is processed completely before calling the
/// Read interface again. Otherwise, fatal errors may occur due to data in the
/// Buffer being overwritten.
class ARROW_EXPORT ChunkBufferedInputStream : public InputStream {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When pre_buffer = true, ReadRangeCache will be used to perform IO coalescence between columns, only when pre_buffer = false and buffered_stream_enabled_ = true, BufferedInputStream will be used to do single column IO. ChunkBufferedInputStream does the same thing with BufferedInputStream, while also providing simple IO merging capabilities, and shielding the upper layer from the discontinuity of IO.

Do you have a plan to implement prebuffer feature when ReadRanges are provided? It seems that we are still uncapable of doing this, which I think it is worth doing.

namespace arrow::io {

struct IoMetrics {
Metric io_time;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need extra metrics like number of actual I/Os and others to effectiveness evaluation.

CheckReadPosition(start, num_bytes, source);

// Prefetch data
PARQUET_THROW_NOT_OK(source->WillNeed(*read_ranges));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen if the read behavior of source->WillNeed(*read_ranges) is different from the IO merging process in the ChunkBufferedInputStream?

std::shared_ptr<ArrowInputStream> GetStream(std::shared_ptr<ArrowInputFile> source,
int64_t start, int64_t num_bytes);
int64_t start, int64_t num_bytes,
const ReadRanges* read_ranges = NULLPTR);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function now returns two different streams depending on the availability of the read_ranges input. It would be good to add some comment to make it clear.

// specific language governing permissions and limitations
// under the License.

#pragma once
Copy link
Member

@wgtmac wgtmac Jan 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huberylee Do you mind taking a look at this patch: #38867? It does similar feature but use a different approach discussed in the doc without optimizing I/O and would be good to consolidate the effort with yours.

virtual ::arrow::Status GetRecordBatchReader(
const std::vector<int>& row_group_indices, const std::vector<int>& column_indices,
const std::optional<std::unordered_map<int, RowRanges>>& row_ranges,
std::unique_ptr<::arrow::RecordBatchReader>* out) = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks duplicate to provide row_group_indices and row_ranges at the same time.

::arrow::Status GetRecordBatchReader(const std::vector<int>& row_group_indices,
std::shared_ptr<::arrow::RecordBatchReader>* out);
::arrow::Status GetRecordBatchReader(std::shared_ptr<::arrow::RecordBatchReader>* out);
::arrow::Status GetRecordBatchReader(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

RETURN_NOT_OK(BoundsCheck(row_groups, column_indices));

if (reader_properties_.pre_buffer()) {
// When row_ranges has value, only the data of hit pages should be load,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the IO coalescing process is predictable, we can still enable prebuffer here with some code change. If we need to implement this, some logic of ChunkBufferedInputStream should be refactored out to be shared by prebuffer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the IO coalescing process is predictable, we can still enable prebuffer here with some code change. If we need to implement this, some logic of ChunkBufferedInputStream should be refactored out to be shared by prebuffer.

Yes. In the current implementation, the use of prebuffer is avoided regardless of whether pre_buffer is enabled or not.

};
}

FileColumnIteratorFactory SomeRowGroupsFactory(std::vector<int> row_groups,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to use interface below:

FileColumnIteratorFactory SomeRowGroupsFactory(std::map<int, RowRanges> ranges);

In this way, GetFieldReader and GetFieldReaders can be changed without adding extra optional parameter.

  // RowGroups can be either std::vector<int> or std::map<int, RowRanges>
  template <typename RowGroups>
  Status GetFieldReader(int i,
                        const std::shared_ptr<std::unordered_set<int>>& included_leaves,
                        const RowGroups& row_groups,
                        std::unique_ptr<ColumnReaderImpl>* out)

  template <typename RowGroups>
  Status GetFieldReaders(const std::vector<int>& column_indices,
                         const RowGroups& row_groups,
                         std::vector<std::shared_ptr<ColumnReaderImpl>>* out,
                         std::shared_ptr<::arrow::Schema>* out_schema)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to use interface below:

FileColumnIteratorFactory SomeRowGroupsFactory(std::map<int, RowRanges> ranges);

In this way, GetFieldReader and GetFieldReaders can be changed without adding extra optional parameter.

  // RowGroups can be either std::vector<int> or std::map<int, RowRanges>
  template <typename RowGroups>
  Status GetFieldReader(int i,
                        const std::shared_ptr<std::unordered_set<int>>& included_leaves,
                        const RowGroups& row_groups,
                        std::unique_ptr<ColumnReaderImpl>* out)

  template <typename RowGroups>
  Status GetFieldReaders(const std::vector<int>& column_indices,
                         const RowGroups& row_groups,
                         std::vector<std::shared_ptr<ColumnReaderImpl>>* out,
                         std::shared_ptr<::arrow::Schema>* out_schema)

If the interface requires all row groups to provide ranges information, we can do this change.

row_groups_(row_groups.begin(), row_groups.end()) {}

explicit FileColumnIterator(int column_index, ParquetFileReader* reader,
std::vector<int> row_groups, RowRangesOpt row_ranges)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider to combine row_groups and row_ranges.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider to combine row_groups and row_ranges.

Split row_groups and row_ranges into two different args is meaningful, for some row groups only hit some rows, and some row groups hit all rows in row group.

auto row_ranges_map = row_ranges_.value();
auto it = row_ranges_map.find(row_group_ordinal);
if (it != row_ranges_map.end()) {
auto index_reader = reader_->GetPageIndexReader()->RowGroup(row_group_ordinal);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error handling is required here when index_reader is not available.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error handling is required here when index_reader is not available.

Yes, it should go back to original logical when index_reader is not available.

@huberylee huberylee closed this Jan 9, 2024
@huberylee huberylee reopened this Jan 9, 2024
@huberylee
Copy link
Contributor Author

I commented on specific sections but looking over the PR it seems quite different then what was discussed in https://docs.google.com/document/d/1SeVcYudu6uD9rb9zRAnlLGgdauutaNZlAaS0gVzjkgM/edit I'm not sure I understand what is the cause of the discrepencies?

The current implementation in MR is completely different from that in https://docs.google.com/document/d/1SeVcYudu6uD9rb9zRAnlLGgdauutaNZlAaS0gVzjkgM . It provides a new implementation idea. If the community finds it meaningful, let us work together to integrate these two different implementations.

}

if (cached_source_ && prebuffered_column_chunks_bitmap_ != nullptr &&
::arrow::bit_util::GetBit(prebuffered_column_chunks_bitmap_->data(), i)) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we still have to read all the data here?

@github-actions github-actions bot added the Status: stale-warning Issues and PRs flagged as stale which are due to be closed if no indication otherwise label Nov 18, 2025
@thisisnic
Copy link
Member

Thank you for your contribution. Unfortunately, this
pull request has been marked as stale because it has had no activity in the past 365 days. Please remove the stale label
or comment below, or this PR will be closed in 14 days. Feel free to re-open this if it has been closed in error. If you
do not have repository permissions to reopen the PR, please tag a maintainer.

@github-actions github-actions bot removed the Status: stale-warning Issues and PRs flagged as stale which are due to be closed if no indication otherwise label Dec 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[C++][Parquet] Support read by row ranges

7 participants