ARROW-18420: [C++][Parquet] Introduce ColumnIndex & OffsetIndex by wgtmac · Pull Request #14803 · apache/arrow

wgtmac · 2022-12-01T14:55:33Z

Basically this patch has defined the interface of ColumnIndex and OffsetIndex. Implementation classes are also provided to deserialize byte stream, wrap thrift message and provide access to their attributes.

BTW, the naming style throughout the code base looks confusing to me. I have tried to follow what I have understood from the parquet sub-directory. Please correct me if anything is incorrect.

github-actions · 2022-12-01T14:55:57Z

https://issues.apache.org/jira/browse/ARROW-18420

wgtmac · 2022-12-01T15:06:39Z

@pitrou @emkornfield Can you please take a look at the proposed interface and implementation? Once they are finalized, I will add unit tests. Thank you very much!

pitrou

Some comments on the API, I didn't look at the implementation.

cpp/src/parquet/page_index.h

pitrou · 2022-12-01T15:26:34Z

cpp/src/parquet/page_index.h

Is it useful to have this in addition from null_pages?

This makes it easy to know which page_id is valid when the return values from line 92 are used.

cpp/src/parquet/page_index.h

cpp/src/parquet/page_index.cc

wgtmac · 2022-12-02T15:36:43Z

In the last commit, I have removed all the per-page accessors. They are now actually thin wrappers to the corresponding thrift class. How about this? @emkornfield @pitrou

pitrou

Thanks for this @wgtmac . Here are some more comments.

cpp/src/parquet/page_index.h

cpp/src/parquet/page_index.cc

pitrou · 2022-12-05T13:57:41Z

cpp/src/parquet/page_index.cc

Can use arrow::Unreachable() from arrow/util/unreachable.h.

Instead can we remove the default from the switch? or at least throw an exception, crashing seems like the wrong choice if a new enum is added.

Instead can we remove the default from the switch? or at least throw an exception, crashing seems like the wrong choice if a new enum is added.

Agreed. In this way, the developer gets noticed there is something to fix at compile time instead of run time.

cpp/src/parquet/page_index.cc

cpp/src/parquet/page_index.h

cpp/src/parquet/page_index.cc

pitrou · 2022-12-06T10:06:31Z

I just realized: in this API, encoded_min_values and min_values use different conventions? (one is dense, the other is sparse).

wgtmac · 2022-12-06T10:26:06Z

I just realized: in this API, encoded_min_values and min_values use different conventions? (one is dense, the other is sparse).

Yes, the encoded values are directly populated from the thrift ColumnIndex object, with null pages filled by an empty string. The decoded values have excluded null pages so it is much easier to do binary search if the boundary order is ASC or DESC without considering null pages. This is especially effective to implement predicate push down.

wgtmac · 2022-12-06T15:40:31Z

I just realized: in this API, encoded_min_values and min_values use different conventions? (one is dense, the other is sparse).

After checking the potential use cases of ColumnIndex, I think it makes sense to follow same convention for encoded_min_values() and min_values() to have same number of pages. For predicate push down, the binary search for sorted column index can be simply done by iterating page_ids returned from non_null_page_indices(). The latest commit reflects this change. Please take a look, thanks! @pitrou

pitrou · 2022-12-06T17:00:17Z

Thanks @wgtmac . Is it possible to add some tests for this? One option is to rely on data files from https://github.com/apache/parquet-testing/tree/master/data (you might also upload your own there if desired)

cpp/src/parquet/page_index.cc

wgtmac · 2022-12-07T07:55:54Z

Thanks @wgtmac . Is it possible to add some tests for this? One option is to rely on data files from https://github.com/apache/parquet-testing/tree/master/data (you might also upload your own there if desired)

I have added page_index_test.cc to test deserialization of page index objects. Please take a look, thanks!

cpp/src/parquet/page_index_test.cc

pitrou · 2022-12-12T14:15:27Z

cpp/src/parquet/page_index_test.cc

+  const std::vector<size_t> page_indices = {0, 1};
+  const std::vector<bool> null_pages = {true, true};
+  // It seems that the null_counts are malformed.
+  const bool has_null_counts = true;
+  const std::vector<int64_t> null_counts = {-1, -1};
+  const std::vector<int32_t> min_values = {};
+  const std::vector<int32_t> max_values = {};


Seems like a bad example to be honest. Can you find or generate a test case which has both null and non-null pages in a single column?

There isn't. Most of the files are even created without page index. Sigh :(

But why not generate a test file yourself? There's probably an implementation that would allow to do it?

pitrou · 2022-12-12T14:16:22Z

cpp/src/parquet/page_index_test.cc

+  const std::vector<int32_t> max_values = {};
+
+  TestReadTypedColumnIndex<Int32Type>(
+      "datapage_v1-uncompressed-checksum.parquet", column_id, num_pages, boundary_order,


If you read this file, you'll see that all data is actually non-null. So the column index in this file may be bogus?
(or you're reading it wrong?)

I think the page index is corrupted. The parquet-cli has provided same reading:

➜ parquet-cli column-index datapage_v1-uncompressed-checksum.parquet row-group 0: column index for column a: Boundary order: ASCENDING null count min max page-0 -1 <none> <none> page-1 -1 <none> <none> offset index for column a: offset compressed size first row index page-0 4 10268 0 page-1 10272 10268 2560 column index for column b: Boundary order: ASCENDING null count min max page-0 -1 <none> <none> page-1 -1 <none> <none> offset index for column b: offset compressed size first row index page-0 20540 10268 0 page-1 30808 10268 2560

pitrou

I pushed a couple of minor changes, but it seems the tests are still lacking a bit. See comments.

cpp/src/parquet/page_index.cc

pitrou · 2022-12-13T15:38:07Z

Thanks a lot! I think this is good to go now, will just wait for CI.

emkornfield · 2022-12-14T05:55:52Z

CC @fatemehp

ursabot · 2022-12-14T18:10:16Z

Benchmark runs are scheduled for baseline = fe5c9fe and contender = 56cf063. 56cf063 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.03% ⬆️0.0%] test-mac-arm
[Finished ⬇️2.99% ⬆️5.71%] ursa-i9-9960x
[Finished ⬇️1.33% ⬆️1.57%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 56cf063b ec2-t3-xlarge-us-east-2
[Finished] 56cf063b test-mac-arm
[Finished] 56cf063b ursa-i9-9960x
[Finished] 56cf063b ursa-thinkcentre-m75q
[Finished] fe5c9fe3 ec2-t3-xlarge-us-east-2
[Finished] fe5c9fe3 test-mac-arm
[Finished] fe5c9fe3 ursa-i9-9960x
[Finished] fe5c9fe3 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot · 2022-12-14T18:12:49Z

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

github-actions bot added Component: C++ Component: Parquet labels Dec 1, 2022

pitrou reviewed Dec 1, 2022

View reviewed changes

wgtmac requested a review from pitrou December 2, 2022 03:03