Add support for extended (chunked) arrays for Parquet format#40485
Conversation
…oup to parse Parquet
|
Earlier this morning I managed to compare the top 50% of the original file with the one processed by ClickHouse using Spark & pyspark. They match! Didn't compare the other half because the pyspark API interface available to grab last N elements is I can't use the test file I have to implement a test because it's confidential data. I just managed to generate a file with the schema presented in the issue, but it doesn't raise the exception. The test file that was provided contains many more columns than what was in the select statement, maybe it has something to do with that. Will investigate further. |
|
I have validated the implementation by doing the following:
To be sure the assertion was doing its job, I dropped the first row in the original DataFrame and the assertion failed. Dropped it with I have spent the last week trying to write a test, but failed to do so. Mainly because I could not generate a file that raises the exception. I have tried several combinations using VERY large strings with LOW and HIGH cardinality. None of them caused the issue. AFAIK, the data gets internally chunked when the chunk memory limit is reached within a rowgroup. As of now, it's set to 2^32 - 1. Since I validated it works in that case, I'll set this PR as ready for review. I am open for discussions. |
|
Marked it as draft again. While the initial case is working properly, the below isn't: |
Arrow lib contains two variants of String: String and LargeString. After some hackish changes on The changes in this PR seems to fix the original case by avoiding that code path. It doesn't solve the latter, tho. It was a suggestion from Based on that, I am re-opening this PR as ready-for-review. |
|
@Avogar kind ping :) |
Backport ClickHouse#40485 to 22.3: fix parquet chunked arrays
…nked-array-deserialization Add support for extended (chunked) arrays for Parquet format
Backport ClickHouse#40485 to 22.3: fix parquet chunked arrays
|
WTF? |
|
Actually it's alright. |
…nked-array-deserialization Add support for extended (chunked) arrays for Parquet format
…array_40485 22.8 Backport of ClickHouse#40485 parquet chunked array support
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
ClickHouse was using parquet::FileReader::ReadAll to parse Parquet. This code path leads to Nested data conversions not implemented for chunked array outputs when the input ends up building a chunked array internally. According to arrow-upstream folks, using the
FileReader::GetRecordBatchReaderwould result in a different code path that could work.The SELECT statement succeeds when using the latter, but I couldn't verify if the data is correct. I believe it is. In order to verify, I need to find a way to compare the original Parquet file with the one processed by ClickHouse. The test file I have been using is big and ClickHouse fails to export it to Parquet (this is a different problem not in the scope of this PR).
I have tried some combinations of transformations (JSON & Parquet) using Python, Spark & ClickHouse to find a way to validate the data, all of them failed because of a variety of reasons like:
pyarrowthrows the very same exception ClickHouse was throwing when it tries to read the original file. Can't load it into memory.fastparquetfails to read the original file with weird exception.arrowinternal memory limitation. It throws an Exception.I'll continue investigating ways to validate the impl and possibly implement a test.
Closes #39944