Skip to content

Add support for extended (chunked) arrays for Parquet format in ClickHouse please #39944

@hodgesrm

Description

@hodgesrm

Problem

The S3 function fails with an exception when reading from a Parquet file with large map data.

Version

22.3.6.5

Reproduction

Here is the SELECT that triggers the problem.

select id, fields_map
from s3(
'my_parquet_file.parquet',
'Parquet',
'id Int64, fields_map Map(String, String)')
;

Here is the the schema and data size that triggers problems. (Collected with parquet-tools.)

  optional group fields_map (MAP) = 217 {
    repeated group key_value {
      required binary key (STRING) = 218;
      optional binary value (STRING) = 219;
    }
  }

fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047
fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963

And here is the resulting error.

Code: 33. DB::Exception: Received from localhost:9000. DB::Exception: Error while reading Parquet data: NotImplemented: Nested data conversions not implemented for chunked array outputs: While executing ParquetBlockInputFormat: While executing S3. (CANNOT_READ_ALL_DATA)

The full stack trace is as follows:

chi-ch-39-ch-39-0-0-0 ch 2022.08.06 18:24:41.451777 [ 2347 ] {a7552683-f8dc-4ad4-a838-4555dc944e28} <Error> TCPHandler: Code: 33. DB::Exception: Error while reading Parquet data: NotImplemented: Nested data conversions not implemented for chunked array outputs: While executing ParquetBlockInputFormat: While executing S3. (CANNOT_READ_ALL_DATA), Stack trace (when copying this message, always include the lines below):
chi-ch-39-ch-39-0-0-0 ch
chi-ch-39-ch-39-0-0-0 ch 0. DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool) @ 0xb37173a in /usr/bin/clickhouse
chi-ch-39-ch-39-0-0-0 ch 1. DB::ParquetBlockInputFormat::generate() @ 0x169f6702 in /usr/bin/clickhouse
chi-ch-39-ch-39-0-0-0 ch 2. DB::ISource::tryGenerate() @ 0x168fc395 in /usr/bin/clickhouse
chi-ch-39-ch-39-0-0-0 ch 3. DB::ISource::work() @ 0x168fbf5a in /usr/bin/clickhouse
chi-ch-39-ch-39-0-0-0 ch 4. DB::ExecutionThreadContext::executeTask() @ 0x1691c6e3 in /usr/bin/clickhouse
chi-ch-39-ch-39-0-0-0 ch 5. DB::PipelineExecutor::executeStepImpl(unsigned long, std::__1::atomic<bool>*) @ 0x1691013e in /usr/bin/clickhouse
chi-ch-39-ch-39-0-0-0 ch 6. DB::PipelineExecutor::executeStep(std::__1::atomic<bool>*) @ 0x1690f960 in /usr/bin/clickhouse
chi-ch-39-ch-39-0-0-0 ch 7. DB::PullingPipelineExecutor::pull(DB::Chunk&) @ 0x1692120e in /usr/bin/clickhouse
chi-ch-39-ch-39-0-0-0 ch 8. DB::StorageS3Source::generate() @ 0x160f5a6c in /usr/bin/clickhouse
chi-ch-39-ch-39-0-0-0 ch 9. DB::ISource::tryGenerate() @ 0x168fc395 in /usr/bin/clickhouse
chi-ch-39-ch-39-0-0-0 ch 10. DB::ISource::work() @ 0x168fbf5a in /usr/bin/clickhouse
chi-ch-39-ch-39-0-0-0 ch 11. DB::SourceWithProgress::work() @ 0x16b53862 in /usr/bin/clickhouse
chi-ch-39-ch-39-0-0-0 ch 12. DB::ExecutionThreadContext::executeTask() @ 0x1691c6e3 in /usr/bin/clickhouse
chi-ch-39-ch-39-0-0-0 ch 13. DB::PipelineExecutor::executeStepImpl(unsigned long, std::__1::atomic<bool>*) @ 0x1691013e in /usr/bin/clickhouse
chi-ch-39-ch-39-0-0-0 ch 14. ? @ 0x16911aa4 in /usr/bin/clickhouse
chi-ch-39-ch-39-0-0-0 ch 15. ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>) @ 0xb418b97 in /usr/bin/clickhouse
chi-ch-39-ch-39-0-0-0 ch 16. ? @ 0xb41c71d in /usr/bin/clickhouse
chi-ch-39-ch-39-0-0-0 ch 17. ? @ 0x7f7e70c2e609 in ?
chi-ch-39-ch-39-0-0-0 ch 18. __clone @ 0x7f7e70b53163 in ?

Workaround?

There does not appear to be a workaround for this problem.

Additional Information

A similar-looking problem was fixed in Arrow libraries in 2019: https://issues.apache.org/jira/browse/ARROW-4688. This increased the limit from 16MB to 2GB. Other usages of these parquet files outside ClickHouse don't encounter read issues, so maybe the ClickHouse library is out of date?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions