Skip to content

Save marks for each substream in Compact part to be able to read individual subcolumns#77940

Merged
Avogar merged 24 commits intoClickHouse:masterfrom
Avogar:compact-part-subcolumns
Apr 23, 2025
Merged

Save marks for each substream in Compact part to be able to read individual subcolumns#77940
Avogar merged 24 commits intoClickHouse:masterfrom
Avogar:compact-part-subcolumns

Conversation

@Avogar
Copy link
Copy Markdown
Member

@Avogar Avogar commented Mar 19, 2025

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Change the Compact part format to save marks for each substream to be able to read individual subcolumns. Old Compact format is still supported for reads and can be enabled for writes using MergeTree setting write_marks_for_substreams_in_compact_parts. It's disabled by default for safer upgrades as it changes the compact parts storage. It will be enabled by default in one of the next releases.

Closes #76141

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 19, 2025

Workflow [PR], commit [e07f330]

@clickhouse-gh clickhouse-gh bot added the pr-performance Pull request with some performance improvements label Mar 19, 2025
@alexey-milovidov alexey-milovidov changed the title Save marks for each substream in Compact part to be alble to read individual subcolumns Save marks for each substream in Compact part to be able to read individual subcolumns Mar 26, 2025
@Avogar Avogar marked this pull request as ready for review April 4, 2025 19:24
@CurtizJ CurtizJ self-assigned this Apr 4, 2025
@Avogar
Copy link
Copy Markdown
Member Author

Avogar commented Apr 8, 2025

Screenshot 2025-04-08 at 17 19 26 🔥

@Avogar
Copy link
Copy Markdown
Member Author

Avogar commented Apr 11, 2025

Let's try to include it in 25.4 release if possible

Copy link
Copy Markdown
Member

@CurtizJ CurtizJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general LGTM.

This mode allows to use significantly less memory for storing discriminators
in parts when there is mostly one variant or a lot of NULL values.
)", 0) \
DECLARE(Bool, write_marks_for_substreams_in_compact_parts, true, R"(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe set to false by default in the first supported release to allow rollback from it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't compatibility setting be enough? For services with old compatibility setting we will have this disabled by default.

@Avogar
Copy link
Copy Markdown
Member Author

Avogar commented Apr 15, 2025

@CurtizJ did you have a chance to check changes in the sync as well?

@CurtizJ
Copy link
Copy Markdown
Member

CurtizJ commented Apr 15, 2025

Yes, I checked the changes in sync. They are ok as well.

@Avogar Avogar added this pull request to the merge queue Apr 23, 2025
Merged via the queue into ClickHouse:master with commit de8d34d Apr 23, 2025
116 of 121 checks passed
@Avogar Avogar deleted the compact-part-subcolumns branch April 23, 2025 22:52
@robot-clickhouse-ci-1 robot-clickhouse-ci-1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Apr 23, 2025
baibaichen added a commit to Kyligence/gluten that referenced this pull request Apr 24, 2025
It introduces columns_substreams.txt for MergeTree's compact mode, causing test failures as both increased file sizes and additional file count alter compaction patterns compared to prior implementations.

Changes:
- Updated file counting logic to exclude the new "columns_substreams.txt"
- Updated comments with correct file sizes and improved clarity
- Updated hardcoded config strings to use DeltaSQLConf constants
baibaichen added a commit to apache/gluten that referenced this pull request Apr 24, 2025
* [GLUTEN-1632][CH]Daily Update Clickhouse Version (20250424)

* Fix ut due to ClickHouse/ClickHouse#77940

It introduces columns_substreams.txt for MergeTree's compact mode, causing test failures as both increased file sizes and additional file count alter compaction patterns compared to prior implementations.

Changes:
- Updated file counting logic to exclude the new "columns_substreams.txt"
- Updated comments with correct file sizes and improved clarity
- Updated hardcoded config strings to use DeltaSQLConf constants

---------

Co-authored-by: kyligence-git <[email protected]>
Co-authored-by: Chang chen <[email protected]>
@Avogar Avogar mentioned this pull request May 12, 2025
56 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-performance Pull request with some performance improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve performance of subcolumns reading from compact parts

3 participants