GH-33115: [C++] Parquet Implement crc in reading and writing Page for DATA_PAGE (v1)#14351
GH-33115: [C++] Parquet Implement crc in reading and writing Page for DATA_PAGE (v1)#14351wjones127 merged 49 commits intoapache:masterfrom
Conversation
|
|
|
Thanks for the PR @mapleFU . I haven't looked in detail yet. IMHO The testing approach should be to add reference files in https://github.com/apache/parquet-testing/tree/master/data .
This would probably be spread over two files, one with correct CRCs, one with incorrect CRCs. |
This actually seems more complicated to implement (you have to replicate details of the file format in the tests). |
Sure, let me generate some data files in parquet-mr following |
|
Since the code is simple and the algorithm well-know, my vote is to own it and keep it in |
|
I like |
|
@mapleFU Is this still WIP? |
# Conflicts: # docs/source/cpp/parquet.rst
|
Some comment fixed, test would be added later |
|
As I said before:
|
|
And, yes, one or two test vectors is enough. But make sure to include bytes >= 128 (to check for signedness issues). |
5d71c7d to
300b680
Compare
|
@pitrou I decide to add boost crc32 for testing, though we don't introduce it in our code, but I think using it to test is a goot idea. |
Seems windows cannot use random uint8_t... |
2f167bb to
17e13bd
Compare
|
Oh, seems #34038 cause the error... Let's waiting it to be merged... |
wjones127
left a comment
There was a problem hiding this comment.
This looks good. We use boost for testing in other places, so I think that's reasonable.
This patch add crc in writing and reading DATA_PAGE. And crc for dictionary, DATA_PAGE_V2 will be added in comming patches.
parquet-mrAnd there is some questions, I found that in thirdparty, arrow imports
crc32c, which is extracted from leveldb's crc library. But seems that our standard uses crc32, which has a different magic number. So I vendor implementions mentioned in https://issues.apache.org/jira/browse/ARROW-17904 .The default config of
enable crcin parquet-mr for writer is true, but here I usefalse, because set it true may slow down writer.