Add support for extended (chunked) arrays for Parquet format by arthurpassos · Pull Request #40485 · ClickHouse/ClickHouse

arthurpassos · 2022-08-22T11:45:10Z

Changelog category (leave one):

Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

ClickHouse was using parquet::FileReader::ReadAll to parse Parquet. This code path leads to Nested data conversions not implemented for chunked array outputs when the input ends up building a chunked array internally. According to arrow-upstream folks, using the FileReader::GetRecordBatchReader would result in a different code path that could work.

The SELECT statement succeeds when using the latter, but I couldn't verify if the data is correct. I believe it is. In order to verify, I need to find a way to compare the original Parquet file with the one processed by ClickHouse. The test file I have been using is big and ClickHouse fails to export it to Parquet (this is a different problem not in the scope of this PR).

I have tried some combinations of transformations (JSON & Parquet) using Python, Spark & ClickHouse to find a way to validate the data, all of them failed because of a variety of reasons like:

Python's pyarrow throws the very same exception ClickHouse was throwing when it tries to read the original file. Can't load it into memory.
Python's fastparquet fails to read the original file with weird exception.
ClickHouse can't export the SELECT statement into Parquet because of some arrow internal memory limitation. It throws an Exception.
Splitting the files into chunks and exporting failed because of formatting and encoding issues. In JSON, for instance, some escape characters were added by ClickHouse, while they were not added by Spark.

I'll continue investigating ways to validate the impl and possibly implement a test.

Closes #39944

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

…oup to parse Parquet

arthurpassos · 2022-08-22T16:28:37Z

Earlier this morning I managed to compare the top 50% of the original file with the one processed by ClickHouse using Spark & pyspark. They match! Didn't compare the other half because the pyspark API interface available to grab last N elements is tail and it was taking way more memory than limit.

I can't use the test file I have to implement a test because it's confidential data. I just managed to generate a file with the schema presented in the issue, but it doesn't raise the exception. The test file that was provided contains many more columns than what was in the select statement, maybe it has something to do with that. Will investigate further.

arthurpassos · 2022-08-29T14:20:43Z

I have validated the implementation by doing the following:

Export the file processed by ClickHouse into two Parquet files, command: clickhouse-client --query "select id, _acc_fields_map from s3(<file_path>, 'Parquet', 'id Int64, _acc_fields_map Map(String, String)') order by id limit <lower_limit>, <upper_limit> FORMAT Parquet" > parquet validation/processed<part_number>.parquet. lower_limit, upper_limit and part_number being 0/80692, 80692/161384 and 1/2 for part1 and part2 respectively.
Using PySpark, concat both parts processed by ClickHouse into a single DataFrame. Then, assert the equality of the original dataframe (read from original file) with the processed data frame. Python script:

original_df = spark.read.parquet("original-strippedd.parquet").orderBy("id")

processed_df1 = spark.read.parquet("processed1.parquet")

processed_df2 = spark.read.parquet("processed2.parquet")

processed_concatened = processed_df1.union(processed_df2).withColumn("fields_map", F.col("fields_map").cast("map<string,string>"))

assert original_df_filtered.collect() == processed_concatened.collect()

To be sure the assertion was doing its job, I dropped the first row in the original DataFrame and the assertion failed. Dropped it with original_df_filtered = original_df.where(original_df.id > <first_id>). The assertion failed.

I have spent the last week trying to write a test, but failed to do so. Mainly because I could not generate a file that raises the exception. I have tried several combinations using VERY large strings with LOW and HIGH cardinality. None of them caused the issue. AFAIK, the data gets internally chunked when the chunk memory limit is reached within a rowgroup. As of now, it's set to 2^32 - 1.

Since I validated it works in that case, I'll set this PR as ready for review. I am open for discussions.

arthurpassos · 2022-08-30T12:38:30Z

Marked it as draft again. While the initial case is working properly, the below isn't:

import pyarrow as pa
import pyarrow.parquet as pq

arr = pa.array([[("a" * 2**30, 1)]], type = pa.map_(pa.string(), pa.int32()))
arr = pa.chunked_array([arr, arr])
tab = pa.table({ "arr": arr })

pq.write_table(tab, "test.parquet")

pq.read_table("test.parquet")

arthurpassos · 2022-09-01T17:19:58Z

Marked it as draft again. While the initial case is working properly, the below isn't:

import pyarrow as pa
import pyarrow.parquet as pq

arr = pa.array([[("a" * 2**30, 1)]], type = pa.map_(pa.string(), pa.int32()))
arr = pa.chunked_array([arr, arr])
tab = pa.table({ "arr": arr })

pq.write_table(tab, "test.parquet")

pq.read_table("test.parquet")

Arrow lib contains two variants of String: String and LargeString. After some hackish changes on arrow lib, I managed to force the use of LargeStrings and the data doesn't get chunked anymore. Using LargeString by default certainly has its implications and does use more resources. Suggestion from arrow upstream folks is to add a setting to the library API that allows conversion from non-large types to large types. Since it's a 3rd party library and a rather complicated change, it could take a long time to get this right & merged. More info on ARROW-17459.

The changes in this PR seems to fix the original case by avoiding that code path. It doesn't solve the latter, tho. It was a suggestion from arrow upstream folks in ARROW-17459-comment.

Based on that, I am re-opening this PR as ready-for-review.

arthurpassos · 2022-09-01T17:29:31Z

@Avogar kind ping :)

Backport ClickHouse#40485 to 22.3: fix parquet chunked arrays

…nked-array-deserialization Add support for extended (chunked) arrays for Parquet format

Backport ClickHouse#40485 to 22.3: fix parquet chunked arrays

alexey-milovidov · 2023-01-29T01:44:29Z

arrow::Status read_status = rbr->ReadAll(&table);

WTF?
Revert?

alexey-milovidov · 2023-01-29T01:52:33Z

Actually it's alright.

…nked-array-deserialization Add support for extended (chunked) arrays for Parquet format

…array_40485 22.8 Backport of ClickHouse#40485 parquet chunked array support

Use FileReader::GetRecordBatchReader instead of FileReader::ReadRowGr…

f8e2ab0

…oup to parse Parquet

robot-clickhouse added the pr-improvement Pull request with some product improvements label Aug 22, 2022

alexey-milovidov added the can be tested Allows running workflows for external contributors label Aug 22, 2022

Avogar self-assigned this Aug 22, 2022

arthurpassos marked this pull request as ready for review August 29, 2022 14:21

arthurpassos marked this pull request as draft August 30, 2022 12:37

arthurpassos marked this pull request as ready for review September 1, 2022 17:29

Avogar approved these changes Sep 1, 2022

View reviewed changes

Avogar merged commit f53aa86 into ClickHouse:master Sep 1, 2022

This was referenced Sep 12, 2022

22.3.10 Pre-release PR Altinity/ClickHouse#188

Merged

22.3.12 Pre-release Altinity/ClickHouse#199

Merged

Backport #40485 to 22.3: fix parquet chunked arrays Altinity/ClickHouse#203

Merged

Enmk added a commit to Altinity/ClickHouse that referenced this pull request Sep 19, 2022

Merge pull request #203 from Enmk/parquet_chunked_array

b53fd6f

Backport ClickHouse#40485 to 22.3: fix parquet chunked arrays

Enmk mentioned this pull request Sep 20, 2022

Backports for 22.3.12 Altinity/ClickHouse#204

Closed

Enmk pushed a commit to Altinity/ClickHouse that referenced this pull request Sep 21, 2022

Merge pull request ClickHouse#40485 from arthurpassos/fix-parquet-chu…

9c86916

…nked-array-deserialization Add support for extended (chunked) arrays for Parquet format

Enmk added a commit to Altinity/ClickHouse that referenced this pull request Sep 21, 2022

Merge pull request #203 from Enmk/parquet_chunked_array

75ec5cb

Backport ClickHouse#40485 to 22.3: fix parquet chunked arrays

arthurpassos mentioned this pull request Nov 16, 2022

Flatten list type arrow chunks on parsing #43297

Merged

Enmk mentioned this pull request Dec 22, 2022

22.3.15 Pre-release PR Altinity/ClickHouse#208

Merged

Enmk mentioned this pull request Jan 23, 2023

22.8 Backport of #40485 parquet chunked array support Altinity/ClickHouse#223

Merged

alexey-milovidov mentioned this pull request Jan 29, 2023

Processing a Parquet file requires a larger amount of memory than expected. #45741

Closed

Enmk pushed a commit to Altinity/ClickHouse that referenced this pull request Feb 8, 2023

Merge pull request ClickHouse#40485 from arthurpassos/fix-parquet-chu…

111ca9c

…nked-array-deserialization Add support for extended (chunked) arrays for Parquet format

Enmk added a commit to Altinity/ClickHouse that referenced this pull request Feb 9, 2023

Merge pull request #223 from Altinity/backports/22.8_parquet_chunked_…

e631e5f

…array_40485 22.8 Backport of ClickHouse#40485 parquet chunked array support

Enmk mentioned this pull request Feb 9, 2023

22.8.13 Pre-release PR Altinity/ClickHouse#230

Merged

Enmk mentioned this pull request Mar 15, 2023

22.8.15 Pre-release PR Altinity/ClickHouse#239

Merged

Enmk mentioned this pull request Apr 6, 2023

22.8.15 Pre-release PR (2) Altinity/ClickHouse#250

Merged

Enmk mentioned this pull request Jan 22, 2025

22.8.21 Altinity Stable Pre-release Altinity/ClickHouse#599

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for extended (chunked) arrays for Parquet format#40485

Add support for extended (chunked) arrays for Parquet format#40485
Avogar merged 1 commit intoClickHouse:masterfrom
arthurpassos:fix-parquet-chunked-array-deserialization

arthurpassos commented Aug 22, 2022 •

edited

Loading

Uh oh!

arthurpassos commented Aug 22, 2022

Uh oh!

arthurpassos commented Aug 29, 2022

Uh oh!

arthurpassos commented Aug 30, 2022

Uh oh!

arthurpassos commented Sep 1, 2022

Uh oh!

arthurpassos commented Sep 1, 2022

Uh oh!

alexey-milovidov commented Jan 29, 2023

Uh oh!

alexey-milovidov commented Jan 29, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

arthurpassos commented Aug 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Uh oh!

arthurpassos commented Aug 22, 2022

Uh oh!

arthurpassos commented Aug 29, 2022

Uh oh!

arthurpassos commented Aug 30, 2022

Uh oh!

arthurpassos commented Sep 1, 2022

Uh oh!

arthurpassos commented Sep 1, 2022

Uh oh!

alexey-milovidov commented Jan 29, 2023

Uh oh!

alexey-milovidov commented Jan 29, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

arthurpassos commented Aug 22, 2022 •

edited

Loading