Conversation
|
This is an automated comment for commit 5ee71bd with description of existing statuses. It's updated for the latest CI running
|
generaterandom table function should be enough to generate them |
baf4741 to
d6f82fa
Compare
cc32954 to
c56fab3
Compare
In combination with |
Would require #48625 too. (Otherwise Array(Nullable()) breaks without a temporary table (like arrays_out_02735), but wide integers (Int128 etc) break with temporary table (because parquet reinterprets them as FixedString, which the INSERT then tries to parse as text).) |
|
hello @al13n321 , I'm testing single thread + custom encoding from this PR, but only to find that custom encoding turns out to be a little slower under single thread case. is it expected? here're my steps: Firstly I guarantee single thread by commenting out the following code if (format_settings.parquet.parallel_encoding && format_settings.max_threads > 1)
pool = std::make_unique<ThreadPool>(
CurrentMetrics::ParquetEncoderThreads, CurrentMetrics::ParquetEncoderThreadsActive,
format_settings.max_threads);Then I use clickhouse-local to test custom encoding and arrow encoding respectively: ws2 :) INSERT INTO TABLE Function file('/tmp/test_write8.parquet') select * from file('/data0/tpch100_zhichao/parquet/customer/part-00000-f36cbfed-be1f-4add-8033-cdf6240fb67a-c000.snappy.parquet') settings output_format_parquet_use_custom_encoder=1
INSERT INTO FUNCTION file('/tmp/test_write8.parquet')
SETTINGS output_format_parquet_use_custom_encoder = 1
SELECT *
FROM file('/data0/tpch100_zhichao/parquet/customer/part-00000-f36cbfed-be1f-4add-8033-cdf6240fb67a-c000.snappy.parquet')
SETTINGS output_format_parquet_use_custom_encoder = 1
Query id: e94b6ae5-b19b-40d5-9671-868bd88f3461
Ok.
0 rows in set. Elapsed: 1.019 sec. Processed 1.50 million rows, 324.30 MB (1.47 million rows/s., 318.13 MB/s.)
ws2 :)
INSERT INTO TABLE Function file('/tmp/test_write9.parquet') select * from file('/data0/tpch100_zhichao/parquet/customer/part-00000-f36cbfed-be1f-4add-8033-cdf6240fb67a-c000.snappy.parquet') settings output_format_parquet_use_custom_encoder=0
INSERT INTO FUNCTION file('/tmp/test_write9.parquet')
SETTINGS output_format_parquet_use_custom_encoder = 0
SELECT *
FROM file('/data0/tpch100_zhichao/parquet/customer/part-00000-f36cbfed-be1f-4add-8033-cdf6240fb67a-c000.snappy.parquet')
SETTINGS output_format_parquet_use_custom_encoder = 0
Query id: 8ed9ecce-3935-47a4-9dc0-e343719e1d7f
Ok.
0 rows in set. Elapsed: 0.969 sec. Processed 1.50 million rows, 324.30 MB (1.55 million rows/s., 334.61 MB/s.)
The dataset I'm using is customer table from TPCH 100. |
320177b to
9b4deda
Compare
|
Thanks for reporting! Reproduced it, got a ~10% difference as well. Interestingly, file size was also ~7% bigger! Turns out there was a difference in dictionary size limit logic between arrow and this PR: arrow re-checks the dictionary size every 1024 rows, while this PR only re-checked on every data page (~1 MB). So it would dictionary-encode more data before falling back to non-dictionary encoding. Fixed. Now I'm getting a ~10% speed difference the other way around (custom faster than arrow), and file size is nearly identical. If this makes a 7% difference in file size, would it get even better if the dictionary -> non-dictionary fallback discards dictionary-encoded data and starts over, instead of writing it out? I tried it and got 1.3% smaller file here, and 3.7% smaller hits.parquet, with little speed difference. Made the change. Btw, CPU profiles: arrow encoder http://ec2-35-167-7-127.us-west-2.compute.amazonaws.com:8080/pr49367-arrow-encoder.svg , custom encoder http://ec2-35-167-7-127.us-west-2.compute.amazonaws.com:8080/pr49367-custom-encoder.svg |
|
thank you @al13n321 for the quick and detailed response! Can we conclude that, by eliminating the extra cost of arrow conversion, we can get a 10% performance improvement (in single thread scenario)? Can we expect a more significant improvement in the final version? And, do you believe the performance gain will be bigger for "read parquet by custom decoder"? |
I didn't particularly investigate where the savings come from and how big they are. But yeah, it would be my guess that the 10% is mostly from avoiding copying/conversion. Another suspect is statistics calculation: the arrow code for that looks less efficient than this PR. There's probably some insight to be gained from the CPU profiles in previous comment, if we were to optimize it more. On (Also, the 10%-14% is for total query time, some of which is spent on decoding the input file. On the TPCH file, decoding takes 1/3 of the total time, so encoding speed difference is more like 15%. Decoding is so slow there because the input file has only 2 row groups. On
I don't plan to optimize this further, unless it turns out someone actually needs it to go faster. I'd guess it's possible to gain maybe another 20%.
Good question. Here's CPU profile for
So, seems that there's > 30% to be gained there! (Here are some more CPU profiles, in case anyone just likes looking at them or something: http://ec2-35-167-7-127.us-west-2.compute.amazonaws.com:8080/pr49367-read-hits-1thread.svg , http://ec2-35-167-7-127.us-west-2.compute.amazonaws.com:8080/pr49367-read-tpch.svg , http://ec2-35-167-7-127.us-west-2.compute.amazonaws.com:8080/pr49367-read-tpch-1thread.svg ) |
|
thanks for the reply, really helps a lot. we're working on "read parquet by custom decoder" and will come back to the community once we have some progress to share with you. |
@liuneng1994 said the same in #49539 (comment) . You guys are working together, right? If not, make sure to coordinate :) |
b5b423f to
af1c9ee
Compare
we do work together 👍 |
|
Compiler crash in CI, doesn't reproduce locally (although maybe my flags are slightly off, have to try some more), oh boy this is going to take some investigating. |
|
This is a very good change, we need it! |
|
Sent LLVM bug report: llvm/llvm-project#63630 |
|
The compiler does not crash anymore, but the fuzzer has found bugs. |
b680870 to
dc23d5c
Compare
|
The compiler continued crashing. Let's workaround it. |
|
From the LLVM IR dump (that people in llvm/llvm-project#63630 taught me to obtain), looks like the problem is in vectorization of this loop (for Int8 type): Added a pragma to disable vectorization for it if MSAN is enabled. |
Avogar
left a comment
There was a problem hiding this comment.
Looks great, let's merge! Although I didn't get into all internals (otherwise it will require a lot of time for me to understand).
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Writing parquet files is 10x faster, it's multi-threaded now. Almost the same speed as reading.