[#67770]: Parallel replicas support for Merge tables#95128
[#67770]: Parallel replicas support for Merge tables#95128matt-metivier wants to merge 3 commits intoClickHouse:masterfrom
Conversation
|
Workflow [PR], commit [2b38544] Summary: ❌
|
99f79be to
6866414
Compare
6866414 to
1532efd
Compare
b98b363 to
979a616
Compare
…rge() table function Enable parallel replicas for StorageMerge and merge() table function. When enabled via parallel_replicas_allow_merge_tables setting, queries to Merge tables or merge() function distribute across replicas. Key changes: - Add parallel_replicas_allow_merge_tables setting (disabled by default) - Allow TABLE_FUNCTION nodes with StorageMerge in parallel replicas eligibility check - Use empty StorageID for table functions to skip table existence check on replicas - Disable parallel replicas coordination for child tables on followers to prevent duplicate announcement errors from the coordinator Limitation: Only works when Merge table/function maps to a single underlying MergeTree table per replica. Multiple matched tables cause coordinator errors.
979a616 to
406eccc
Compare
|
@devcrafter Im planning to spend this weekend working on this hopefully, but if you've suggestion, please let me know |
|
@matt-metivier Here is my consideration with some explanations which can help Parallel replicas is a mechanism to parallelize query execution for a replicated MergeTree table among cluster nodes. So, the reading from a table will be parallelized among nodes and read data will be process to some mergeable stage on each node. The parallelization is done by reading different ranges of parts by different nodes. I can image the following ... Please let me know if you have other considerations or questions. @KochetovNicolai Please comment if you have something to add |
|
I did not read the current implementation, but I find it difficult to implement parallel replicas for MergeTables with the current infrastructure. I think the best we can do now, as @devcrafter mentioned, is to treat StorageMerge as a UNION and enable parallel replicas for each branch independently (each branch will have its own connection, coordinator, tasks, etc). This way, we could parallelize only the reading. Changes in Offloading other steps, for example aggregation, will require changes in both planner and protocol (to support multiplexing between many storages). |
…UNION-per-table approach
…plicas-merge-tables # Conflicts: # src/Core/Settings.cpp # src/Planner/PlannerJoinTree.cpp
Changes
parallel_replicas_allow_merge_tablessetting (default true) to allow Merge tables to use parallel replicas.findParallelReplicasQueryto recognizeStorageMergeandTABLE_FUNCTIONnodes as valid candidates for parallel execution.PlannerJoinTreeandInterpreterSelectQueryto respect the new setting and enable parallel replicas for Merge tables.03725_parallel_replicas_merge_tableto verify behavior.Closes #67770
Changelog category (leave one):
New Feature
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Added support for parallel replicas with Merge tables.