Fix join squashing with sparse columns#81886
Merged
Conversation
Contributor
Contributor
|
Workflow [PR], commit [0ade907] Summary: ❌
|
novikd
approved these changes
Jun 26, 2025
Member
novikd
left a comment
There was a problem hiding this comment.
LGTM, but I see grace_hash_join became slower in the performance tests. Is it related?
Member
Author
It looks so. I haven't found any specific problem I could address, so just decided to ignore it. |
baibaichen
pushed a commit
to Kyligence/gluten
that referenced
this pull request
Jul 3, 2025
baibaichen
pushed a commit
to Kyligence/gluten
that referenced
this pull request
Jul 4, 2025
baibaichen
pushed a commit
to Kyligence/gluten
that referenced
this pull request
Jul 5, 2025
baibaichen
pushed a commit
to apache/gluten
that referenced
this pull request
Jul 6, 2025
* [GLUTEN-1632][CH]Daily Update Clickhouse Version (20250705) * Fix benchmark build * Fix Benchmark build due to ClickHouse/ClickHouse#79417 * Revert "Fix Build due to ClickHouse/ClickHouse#80931" This reverts commit 02d12f6. * Fix Build due to ClickHouse/ClickHouse#81886 * Fix Link issue due to ClickHouse/ClickHouse#83121 * Fix Build due to ClickHouse/ClickHouse#82604 * Fix Build due to ClickHouse/ClickHouse#82945 * Fix Build due to ClickHouse/ClickHouse#83214 --------- Co-authored-by: kyligence-git <[email protected]> Co-authored-by: Chang chen <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Add new setting
min_joined_block_size_rows(analogous tomin_joined_block_size_bytes; defaults to 65409) to control the minimum block size (in rows) for JOIN input and output blocks (if the join algorithm supports it). Small blocks will be squashed.The motivation when we introduced squashing around join transforms for parallel hash was the following. Previously, we physically split blocks to distribute them among parallel hash join "shards". All input blocks automatically became
max_threadstimes smaller after splitting. Since not all rows usually have a match during joining, output blocks were even smaller. On top of that, multiple joins might be chained on each other. Because of that, we saw significant slowdowns with parallel hash on some TPC-H queries. At that time, I decided to use only the byte threshold to configure squashing. The problem withSparsecolumns is that they can have a compression factor close to the number of rows. For such cases, it makes sense also to have the number-of-rows threshold configured, because since our goal is to avoid passing too small blocks along the pipeline, we definitely shouldn't worry about blocks that are bigger thanDEFAULT_BLOCK_SIZE.