Skip to content

Parallelize object storage output#99548

Merged
alexey-milovidov merged 8 commits intomasterfrom
parallelize-object-storage-output
Mar 18, 2026
Merged

Parallelize object storage output#99548
alexey-milovidov merged 8 commits intomasterfrom
parallelize-object-storage-output

Conversation

@alexey-milovidov
Copy link
Copy Markdown
Member

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Improve the performance of data lakes. In previous versions, reading from object storage didn't resize the pipeline to the number of processing threads.

alexey-milovidov and others added 2 commits March 16, 2026 07:35
`StorageObjectStorage` (S3, Azure, GCS) with a single file created only
one pipeline source without calling `pipe.resize()`. This meant the
entire query ran through a single pipeline thread — downstream
processors like `AggregatingTransform` could not run in parallel.

`StorageFile` already does this resize (line 1913 in StorageFile.cpp),
enabling parallel aggregation even with a single input file. This
commit adds the same `pipe.resize(max_threads)` to
`ReadFromObjectStorageStep::initializePipeline`.

The effect on the ClickBench datalake benchmark is dramatic:
- Q28 (`REGEXP_REPLACE` + `GROUP BY`): 79s → 4.8s (was 79× slower
  than local Parquet, now 1.1×)
- Q16 (heavy `GROUP BY`): 21× → 1.2×
- Q13: 12× → 1.3×

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Checks that EXPLAIN PIPELINE shows `Resize 1 → N` after
`ReadFromObjectStorage`, proving the single source output is distributed
across multiple pipeline threads.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 16, 2026

Workflow [PR], commit [aee11a3]

Summary:


AI Review

Summary

This PR parallelizes output in ReadFromObjectStorageStep by adding a Resize stage (guarded by parallelize_output_from_storages and format capability checks), similar to existing behavior in StorageFile and StorageURL. It also adds a stateless test validating Resize 1 → 4 for S3. After reviewing the diff and full touched files, I did not find new blocker/major issues that are both high-confidence and not already covered by existing inline discussion.

Missing context

  • ⚠️ No CI report/log links were provided in the review request, so this review is limited to static code/test analysis.

ClickHouse Rules

Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate
No magic constants
Backward compatibility
SettingsChangesHistory.cpp
Safe rollout
Compilation time

Final Verdict

  • Status: ✅ Approve

@clickhouse-gh clickhouse-gh bot added the pr-performance Pull request with some performance improvements label Mar 16, 2026
Comment on lines +7 to +8
# Without this, queries on a single S3/data-lake Parquet file run
# entirely single-threaded (e.g. Q28 in ClickBench: 79× slower).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think this is true, we have ParallelReadBuffer (which should work with all formats apart from AFAIK new parquet reader) with parallelization via readBigAt. So if execution is single-threaded, then readBigAt is not supported or the file is just not big enough to decide to read it in parallel. See

std::unique_ptr<ParallelReadBuffer> wrapInParallelReadBufferIfSupported(
ReadBuffer & buf, ThreadPoolCallbackRunnerUnsafe<void> schedule, size_t max_working_readers,
size_t range_step, size_t file_size)
{
auto * seekable = dynamic_cast<SeekableReadBuffer*>(&buf);
if (!seekable || !seekable->supportsReadAt())
return nullptr;
return std::make_unique<ParallelReadBuffer>(
*seekable, schedule, max_working_readers, range_step, file_size);
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So non-datalake read is much faster than same data as datalake format?
As we have absolutely the same read-related code for both, then performance difference must be data lake specific, while this PR affects both non-datalake and datalake implementations, e.g. if plain parquet reading is fast enough already, then I think the issue must be fixed differently from what this PR does. Such a big performance difference looks extremely strange (if in datalake it also has the same single parquet data file, not split into multiple small files I mean). This needs to be investigated.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Note: The data lake mode in ClickBench actually means reading Parquet files from S3 (not related to any data lake metadata formats).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the investigation pointed to the difference between StorageFile and StorageObjectStorage.

Copy link
Copy Markdown
Member

@kssenii kssenii Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: The data lake mode in ClickBench actually means reading Parquet files from S3 (not related to any data lake metadata formats).

Did not know this, thought it was iceberg s3 parquet vs s3 parquet...

And the investigation pointed to the difference between StorageFile and StorageObjectStorage.

Ok, looks like it makes sense indeed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But also what is the version on which that benchmark was run for data lake mode? We had 2 fixes for readBigAt in latest versions.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's always master.

@alexey-milovidov
Copy link
Copy Markdown
Member Author

alexey-milovidov commented Mar 16, 2026

I also double checked the PR binary vs. the master binary manually, just in case... and it is phenomenally faster.

alexey-milovidov and others added 5 commits March 16, 2026 23:29
Preserve the original requested stream count in `max_num_streams`,
matching the pattern used by `StorageFile` and `StorageURL`. This
ensures that `pipe.resize` respects the stream cap set by
`max_streams_for_files_processing_in_cluster_functions` for
distributed processing, instead of bypassing it with raw `max_threads`.

#99548 (comment)

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…ouse/ClickHouse into parallelize-object-storage-output
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 18, 2026

LLVM Coverage Report

Metric Baseline Current Δ
Lines 83.70% 83.80% +0.10%
Functions 23.90% 23.90% +0.00%
Branches 76.30% 76.30% +0.00%

PR changed lines: PR changed-lines coverage: 94.12% (16/17, 0 noise lines excluded)
Diff coverage report
Uncovered code

@alexey-milovidov alexey-milovidov added this pull request to the merge queue Mar 18, 2026
Merged via the queue into master with commit 52da74a Mar 18, 2026
163 checks passed
@alexey-milovidov alexey-milovidov deleted the parallelize-object-storage-output branch March 18, 2026 09:23
@robot-ch-test-poll1 robot-ch-test-poll1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Mar 18, 2026
mkmkme pushed a commit to Altinity/ClickHouse that referenced this pull request Mar 26, 2026
…t-storage-output

Parallelize object storage output
zvonand added a commit to Altinity/ClickHouse that referenced this pull request Mar 31, 2026
Antalya 26.1 Backport of ClickHouse#99548 - Parallelize object storage output
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-performance Pull request with some performance improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants