A native parquet reader for primitive types#60361
A native parquet reader for primitive types#60361al13n321 merged 15 commits intoClickHouse:masterfrom
Conversation
|
This is an automatic comment. The PR descriptions does not match the template. Please, edit it accordingly. The error is: More than one changelog category specified: 'Improvement', 'Performance Improvement' |
|
This is an automated comment for commit 3d7befe with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page
Successful checks
|
5012933 to
5166d8b
Compare
|
Well, I think these break tests are not caused by my commit. UnitTestsAsan and ASTFuzzerTestAsan have succeeded in former test. While the integration test is failed because of logical error. Hope for further suggestion. |
|
Yes, we need to start the investigation, and then check and fix every failure one by one. |
@alexey-milovidov |
|
@copperybean, please resolve conflicts. |
al13n321
left a comment
There was a problem hiding this comment.
Cool, thanks for working on this!
Lots of review comments, mostly unimportant. The only blockers are:
- Seems that TIMESTAMP types will be incorrectly interpreted as seconds, please test.
- The block_missing_values_ptr thing, please test. If it doesn't work, at least add a TODO so we don't forget.
- Pass file metadata to ParquetFileReader::Open to avoid re-reading it for each row group.
| /// If defaults_for_omitted_fields is true, calculate the default values from default expression for omitted fields. | ||
| /// Otherwise fill the missing columns with zero values of its type. | ||
| BlockMissingValues * block_missing_values_ptr = format_settings.defaults_for_omitted_fields ? &res.block_missing_values : nullptr; | ||
| row_group_batch.arrow_column_to_ch_column->arrowTableToCHChunk(res.chunk, *tmp_table, (*tmp_table)->num_rows(), block_missing_values_ptr); |
There was a problem hiding this comment.
As this comment says, block_missing_values_ptr is used for applying default expressions to omitted fields (here "omitted" means NULLs, I think).
Native reader doesn't populate block_missing_values_ptr, so it'll probably break default expressions. Please test this:
create table a (x Int64, y String default 'asd') engine Memory
insert into function file('t.parquet') select 1 as x, null::Nullable(String) as y
insert into a select * from file('t.parquet')
select * from a
┌─x─┬─y───┐
1. │ 1 │ asd │
└───┴─────┘
If it doesn't work, either populate block_missing_values_ptr based on null masks or at least add a TODO comment about that.
There was a problem hiding this comment.
Sorry, I missed this feature. I would like to add a TODO comment, because the block-missing-values logic will be modified again when supporting nested type, so the it can be implemented later for all types.
src/Processors/Formats/Impl/Parquet/ParquetDataValuesReader.cpp
Outdated
Show resolved
Hide resolved
| IndividualVisitor && individual_visitor, | ||
| RepeatedVisitor && repeated_visitor); |
There was a problem hiding this comment.
A simpler way to handle nulls is to read all non-null values into a column, then use IColumn::expand to insert default values according to null mask. A little slower because of copying.
I don't mind the current approach, just mentioning the alternative in case it turns out better for the general case of arbitrarily nested arrays, nullables, and tuples.
There was a problem hiding this comment.
OK, I'd prefer to keep current implementation currently, may be I will change the logic as your suggestion when supporting nested columns.
| visitColStrIndexType(dictionary->size(), [&]<typename TColVec>(TColVec *) | ||
| { | ||
| const TColVec & col_src = *static_cast<const TColVec *>(col_existing.get()); | ||
| reserveColumnStrRows(column, reading_rows_num); | ||
|
|
||
| col_dest.getOffsets().resize(col_src.size()); | ||
| for (size_t i = 0; i < col_src.size(); i++) | ||
| { | ||
| auto src_idx = col_src.getData()[i]; | ||
| if (0 == src_idx) | ||
| { | ||
| null_map->setNull(i); | ||
| } | ||
| auto dict_chars_cursor = col_dict_str.getOffsets()[src_idx - 1]; | ||
| auto str_len = col_dict_str.getOffsets()[src_idx] - dict_chars_cursor; | ||
| auto dst_chars_cursor = col_dest.getChars().size(); | ||
| col_dest.getChars().resize(dst_chars_cursor + str_len); | ||
|
|
||
| memcpySmallAllowReadWriteOverflow15( | ||
| &col_dest.getChars()[dst_chars_cursor], &col_dict_str.getChars()[dict_chars_cursor], str_len); | ||
| col_dest.getOffsets()[i] = col_dest.getChars().size(); | ||
| } | ||
| }); |
There was a problem hiding this comment.
Would it be easier to create a ColumnLowCardinality and call convertToFullColumn() on it?
There was a problem hiding this comment.
Well, it will be easier. But the current implementation is more efficient, even not very much. Because the size of new Column generated by convertToFullColumn is smaller than reading_rows_num, so it should be resized to reading_rows_num again.
By the way, I will implement the logic as your suggestion next time. For current codes, I'd prefer not to change it right now. Instead, I can add some comments.
Thanks for the comments, I will fix them in this weekend. |
Change-Id: I83a8ec8271edefcd96cb5b3bcd12f6b545d9dec0
Change-Id: I38b8368b022263d9a71cb3f3e9fdad5d6ca26753
Change-Id: If79741b7456667a8dde3e355d9dc684c2dd84f4f
This reverts commit 5df94b7f8955b541ae37e4bbdc13a1fec9ddbbd9.
5166d8b to
718fa1c
Compare
|
Fix for the error message: |
| case parquet::Type::BYTE_ARRAY: | ||
| return fromByteArray(); | ||
| case parquet::Type::FIXED_LEN_BYTE_ARRAY: | ||
| return fromFLBA(); |
There was a problem hiding this comment.
It appears that the Bool and FLBA (fixed string) types are not currently supported, Are there any plans to add support for these types in the future?
There was a problem hiding this comment.
Bool types are distinct due to their 1-bit width, while clickhouse uses a UInt8 column vector
There was a problem hiding this comment.
With current framework, supporting these types is not difficult. My plan is to support nested type after this PR, then support all parquet types and enable 02735_parquet_encoder test.
There was a problem hiding this comment.
@copperybean Any news on boolean support? I was looking at the code and it looks like ParquetPlainValuesReader needs to be extended for boolean type. The problem is ParquetPlainValuesReader is tightly coupled to ParquetDataBuffer, which does not allow single bit reads.
Moreover, ParquetPlainValuesReader seems to be templated on the output clickhouse column type. Isn't that a problem? What if there are two distinct parquet types that would result in the same output column type, how would one differentiate? For instance: uint8_t and booleans.
I might be missing something, though. I only had a very superficial read of the code.
Change-Id: Iefec91bca223eedaabe302b7891808c6d94eed9d
ddaf172 to
ad5f6f2
Compare
Change-Id: Iff9f5f894e815b184ac35f61b4cac87908c612b5
dd95beb to
b253ca3
Compare
|
It seems that the ClickHouse special build check is failed because of OOM? |
|
It's broken in master, waiting for #64204 (and maybe something else, that one only fixes one of the two OOMing builds). |
| {"cross_join_min_bytes_to_compress", 0, 1_GiB, "A new setting."}, | ||
| {"http_max_chunk_size", 0, 0, "Internal limitation"}, | ||
| {"prefer_external_sort_block_bytes", 0, DEFAULT_BLOCK_SIZE * 256, "Prefer maximum block bytes for external sort, reduce the memory usage during merging."}, | ||
| {"input_format_parquet_use_native_reader", false, false, "When reading Parquet files, to use native reader instead of arrow reader."}, |
|
What a great feature! @copperybean Will you support complex types? @al13n321 Do you have plan to support complex types to make native parquet reader production ready? |
|
@copperybean it's cool feature Do you have plans to support page index? In this case,we need to skip reading some rows. |
|
I tried a few parquet files from example datasets and found that none worked. |
|
I want to highlight this feature on the release webinar, but so far it looks not ready for use. |
|
Don't be obsessed with TPC-DS or TPC-H - they are irrelevant in practice. Only academics use these tests. |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Add a native parquet reader, which can read parquet binary to ClickHouse Columns directly. It's controlled by the setting
input_format_parquet_use_native_reader(disabled by default)Currently, parquet file is read by arrow library, and it's read to arrow table first, and then copy the arrow table to ClickHouse Columns. There are some shortcomings in performance.
int16_tarray first; then nullability relatedvalid_bits_is generated. While, thenull_mapandoffsets(not included this time) can be generated directly when reading definition and repetition levels.This feature is first implemented in the product BMR of Baidu AI Cloud, which has been fully tested.
Performance Test
As a result, the perforation of current implementation is speedup obviously. To generate the test data, a parquet file
src.parquetof TPCDS table store_sales with scale 5000 is used, there are 35207247 rows in this file. Next, the test data is generated with following query:Then the performance is tested by following command
For each field, two types tests are triggered with different
input_format_parquet_use_native_readersetting, and single thread is used. The parquet reading duration is counted as commit log duration while reading parquet. The CPU model used by this test isIntel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz.The test result is detailed in following table
Documentation entry for user-facing changes