Parallelize object storage output#99548
Conversation
`StorageObjectStorage` (S3, Azure, GCS) with a single file created only one pipeline source without calling `pipe.resize()`. This meant the entire query ran through a single pipeline thread — downstream processors like `AggregatingTransform` could not run in parallel. `StorageFile` already does this resize (line 1913 in StorageFile.cpp), enabling parallel aggregation even with a single input file. This commit adds the same `pipe.resize(max_threads)` to `ReadFromObjectStorageStep::initializePipeline`. The effect on the ClickBench datalake benchmark is dramatic: - Q28 (`REGEXP_REPLACE` + `GROUP BY`): 79s → 4.8s (was 79× slower than local Parquet, now 1.1×) - Q16 (heavy `GROUP BY`): 21× → 1.2× - Q13: 12× → 1.3× Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Checks that EXPLAIN PIPELINE shows `Resize 1 → N` after `ReadFromObjectStorage`, proving the single source output is distributed across multiple pipeline threads. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
Workflow [PR], commit [aee11a3] Summary: ✅ AI ReviewSummaryThis PR parallelizes output in Missing context
ClickHouse Rules
Final Verdict
|
| # Without this, queries on a single S3/data-lake Parquet file run | ||
| # entirely single-threaded (e.g. Q28 in ClickBench: 79× slower). |
There was a problem hiding this comment.
I do not think this is true, we have ParallelReadBuffer (which should work with all formats apart from AFAIK new parquet reader) with parallelization via readBigAt. So if execution is single-threaded, then readBigAt is not supported or the file is just not big enough to decide to read it in parallel. See
ClickHouse/src/IO/ParallelReadBuffer.cpp
Lines 296 to 306 in ca135b4
There was a problem hiding this comment.
There was a problem hiding this comment.
So non-datalake read is much faster than same data as datalake format?
As we have absolutely the same read-related code for both, then performance difference must be data lake specific, while this PR affects both non-datalake and datalake implementations, e.g. if plain parquet reading is fast enough already, then I think the issue must be fixed differently from what this PR does. Such a big performance difference looks extremely strange (if in datalake it also has the same single parquet data file, not split into multiple small files I mean). This needs to be investigated.
There was a problem hiding this comment.
Yes.
Note: The data lake mode in ClickBench actually means reading Parquet files from S3 (not related to any data lake metadata formats).
There was a problem hiding this comment.
And the investigation pointed to the difference between StorageFile and StorageObjectStorage.
There was a problem hiding this comment.
Note: The data lake mode in ClickBench actually means reading Parquet files from S3 (not related to any data lake metadata formats).
Did not know this, thought it was iceberg s3 parquet vs s3 parquet...
And the investigation pointed to the difference between StorageFile and StorageObjectStorage.
Ok, looks like it makes sense indeed.
There was a problem hiding this comment.
But also what is the version on which that benchmark was run for data lake mode? We had 2 fixes for readBigAt in latest versions.
There was a problem hiding this comment.
It's always master.
|
I also double checked the PR binary vs. the master binary manually, just in case... and it is phenomenally faster. |
Preserve the original requested stream count in `max_num_streams`, matching the pattern used by `StorageFile` and `StorageURL`. This ensures that `pipe.resize` respects the stream cap set by `max_streams_for_files_processing_in_cluster_functions` for distributed processing, instead of bypassing it with raw `max_threads`. #99548 (comment) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…ouse/ClickHouse into parallelize-object-storage-output
LLVM Coverage Report
PR changed lines: PR changed-lines coverage: 94.12% (16/17, 0 noise lines excluded) |
…t-storage-output Parallelize object storage output
Antalya 26.1 Backport of ClickHouse#99548 - Parallelize object storage output
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Improve the performance of data lakes. In previous versions, reading from object storage didn't resize the pipeline to the number of processing threads.