ARROW-18420: [C++][Parquet] Introduce ColumnIndex & OffsetIndex#14803
ARROW-18420: [C++][Parquet] Introduce ColumnIndex & OffsetIndex#14803pitrou merged 19 commits intoapache:masterfrom
Conversation
|
@pitrou @emkornfield Can you please take a look at the proposed interface and implementation? Once they are finalized, I will add unit tests. Thank you very much! |
pitrou
left a comment
There was a problem hiding this comment.
Some comments on the API, I didn't look at the implementation.
cpp/src/parquet/page_index.h
Outdated
There was a problem hiding this comment.
Is it useful to have this in addition from null_pages?
There was a problem hiding this comment.
This makes it easy to know which page_id is valid when the return values from line 92 are used.
|
In the last commit, I have removed all the per-page accessors. They are now actually thin wrappers to the corresponding thrift class. How about this? @emkornfield @pitrou |
cpp/src/parquet/page_index.cc
Outdated
There was a problem hiding this comment.
Can use arrow::Unreachable() from arrow/util/unreachable.h.
There was a problem hiding this comment.
Instead can we remove the default from the switch? or at least throw an exception, crashing seems like the wrong choice if a new enum is added.
There was a problem hiding this comment.
Instead can we remove the default from the switch? or at least throw an exception, crashing seems like the wrong choice if a new enum is added.
Agreed. In this way, the developer gets noticed there is something to fix at compile time instead of run time.
|
I just realized: in this API, |
Yes, the encoded values are directly populated from the thrift ColumnIndex object, with null pages filled by an empty string. The decoded values have excluded null pages so it is much easier to do binary search if the boundary order is ASC or DESC without considering null pages. This is especially effective to implement predicate push down. |
After checking the potential use cases of |
|
Thanks @wgtmac . Is it possible to add some tests for this? One option is to rely on data files from https://github.com/apache/parquet-testing/tree/master/data (you might also upload your own there if desired) |
I have added |
cpp/src/parquet/page_index_test.cc
Outdated
| const std::vector<size_t> page_indices = {0, 1}; | ||
| const std::vector<bool> null_pages = {true, true}; | ||
| // It seems that the null_counts are malformed. | ||
| const bool has_null_counts = true; | ||
| const std::vector<int64_t> null_counts = {-1, -1}; | ||
| const std::vector<int32_t> min_values = {}; | ||
| const std::vector<int32_t> max_values = {}; |
There was a problem hiding this comment.
Seems like a bad example to be honest. Can you find or generate a test case which has both null and non-null pages in a single column?
There was a problem hiding this comment.
There isn't. Most of the files are even created without page index. Sigh :(
There was a problem hiding this comment.
But why not generate a test file yourself? There's probably an implementation that would allow to do it?
cpp/src/parquet/page_index_test.cc
Outdated
| const std::vector<int32_t> max_values = {}; | ||
|
|
||
| TestReadTypedColumnIndex<Int32Type>( | ||
| "datapage_v1-uncompressed-checksum.parquet", column_id, num_pages, boundary_order, |
There was a problem hiding this comment.
If you read this file, you'll see that all data is actually non-null. So the column index in this file may be bogus?
(or you're reading it wrong?)
There was a problem hiding this comment.
I think the page index is corrupted. The parquet-cli has provided same reading:
➜ parquet-cli column-index datapage_v1-uncompressed-checksum.parquet
row-group 0:
column index for column a:
Boundary order: ASCENDING
null count min max
page-0 -1 <none> <none>
page-1 -1 <none> <none>
offset index for column a:
offset compressed size first row index
page-0 4 10268 0
page-1 10272 10268 2560
column index for column b:
Boundary order: ASCENDING
null count min max
page-0 -1 <none> <none>
page-1 -1 <none> <none>
offset index for column b:
offset compressed size first row index
page-0 20540 10268 0
page-1 30808 10268 2560
pitrou
left a comment
There was a problem hiding this comment.
I pushed a couple of minor changes, but it seems the tests are still lacking a bit. See comments.
|
Thanks a lot! I think this is good to go now, will just wait for CI. |
|
CC @fatemehp |
|
Benchmark runs are scheduled for baseline = fe5c9fe and contender = 56cf063. 56cf063 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
|
['Python', 'R'] benchmarks have high level of regressions. |
Basically this patch has defined the interface of
ColumnIndexandOffsetIndex. Implementation classes are also provided to deserialize byte stream, wrap thrift message and provide access to their attributes.BTW, the naming style throughout the code base looks confusing to me. I have tried to follow what I have understood from the parquet sub-directory. Please correct me if anything is incorrect.