-
Notifications
You must be signed in to change notification settings - Fork 8.3k
DDLWorker fails task execution with on cluster (regression in release tag v25.8.10.7-lts) #89693
Description
Company or project name
Cisco Systems Inc
Describe what's wrong
With the Release tag - v25.8.10.7-lts, all DDLs with ON CLUSTER directives fail on first time installation with the error -
Received exception from server (version 25.8.10): Code: 159. DB::Exception: Received from localhost:9000. DB::Exception: Distributed DDL task /clickhouse/cc/task_queue/ddl/query-0000000000 is not finished on 2 of 2 hosts (0 of them are currently executing the task, 0 are inactive). They are going to execute the query in background. Was waiting for 180.940991669 seconds, which is longer than distributed_ddl_task_timeout. (TIMEOUT_EXCEEDED)
The task never completes and debug logs report the following in the clickhouse-server logs -
2025.11.05 23:25:52.084983 [ 757 ] {} <Debug> DDLWorker: Will not execute task query-0000000002: There is no a local address in host list
The workaround is to restart the click house-server workload and then all DDLWorker task queries succeed as expected .
The regression has been introduced with commit id be89d268ad35c7c593a00c3489dbd72865af880a
be89d26
PR - https://github.com/ClickHouse/ClickHouse/pull/88153/files?diff=split&w=0
Does it reproduce on the most recent release?
Yes
How to reproduce
On a setup deployed with docker image clickhouse/clickhouse-keeper:25.8.10.7 on kubernetes cluster with
3-node clickhouse-keeper
2-node clickhouse-server
After the clickhouse-server launches, using the clickhouse-client execute a DDL query to create a db or a table
ClickHouse client version 25.8.10.7 (official build).
Connecting to localhost:9000 as user chuser.
Connected to ClickHouse server version 25.8.10.
:) create database testdb on cluster 'cdr'
CREATE DATABASE testdb ON CLUSTER cdr
Query id: 23e7b9f6-9909-46fa-b385-41c473cf7c6c
Elapsed: 180.970 sec.
Received exception from server (version 25.8.10):
Code: 159. DB::Exception: Received from localhost:9000. DB::Exception: Distributed DDL task /clickhouse/cc/task_queue/ddl/query-0000000000 is not finished on 2 of 2 hosts (0 of them are currently executing the task, 0 are inactive). They are going to execute the query in background. Was waiting for 180.940991669 seconds, which is longer than distributed_ddl_task_timeout. (TIMEOUT_EXCEEDED)
Debug logs -
2025.11.07 03:40:28.872318 [ 750 ] {} <Debug> DDLWorker: Scheduling tasks
2025.11.07 03:40:28.872331 [ 753 ] {} <Debug> DDLWorker: Cleaning queue
2025.11.07 03:40:28.873266 [ 750 ] {} <Debug> DDLWorker: Will schedule 1 tasks starting from query-0000000000
2025.11.07 03:40:28.874895 [ 750 ] {} <Debug> DDLWorker: Will not execute task query-0000000000: There is no a local address in host list
2025.11.07 03:40:28.874912 [ 750 ] {} <Debug> DDLWorker: Waiting for queue updates
Zookeeper task queue details -
SELECT *
FROM system.zookeeper
WHERE path = '/clickhouse/cc/task_queue/ddl'
Query id: 751f0c29-d9c9-49e4-9fad-499dd629eadf
┌─name─────────────┬─value──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─path──────────────────────────┐
1. │ query-0000000000 │ version: 5 ↴│ /clickhouse/cc/task_queue/ddl │
│ │↳query: CREATE DATABASE testdb UUID \'524fd5ae-c174-4bd6-9908-33eb0a26d62c\' ON CLUSTER cdr ↴│ │
│ │↳hosts: ['chi%2Dcc%2Dcdr%2D0%2D0:9000','chi%2Dcc%2Dcdr%2D0%2D1:9000'] ↴│ │
│ │↳initiator: chi%2Dcc%2Dcdr%2D0%2D0%2D0%2Echi%2Dcc%2Dcdr%2D0%2D0%2Eclickhouse%2Esvc%2Ecluster%2Elocal:9000 ↴│ │
│ │↳settings: connect_timeout_with_failover_ms = 1000, load_balancing = 'nearest_hostname', distributed_aggregation_memory_efficient = true, allow_experimental_time_time64_type = true, do_not_merge_across_partitions_select_final = true, os_thread_priority = 2, log_queries = false, insert_deduplicate = true, final = false, prefer_localhost_replica = false, parallel_view_processing = true, date_time_output_format = 'iso'↴│ │
│ │↳tracing: 00000000-0000-0000-0000-000000000000 ↴│ │
│ │↳0 ↴│ │
│ │↳ ↴│ │
│ │↳0 ↴│ │
│ │↳initial_query_id: 23e7b9f6-9909-46fa-b385-41c473cf7c6c ↴│ │
└──────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴───────────────────────────────┘
Have verified all hostname fqdns match up with the settings in the <remote-servers> configs.
There has been no configuration change since this breakage.
The last known working version before this regression has been 25.8.9
Have confirmed the introduction of this regression by reverting the offending commit be89d268ad35c7c593a00c3489dbd72865af880a on a private branch off of v25.8.10.7-lts.
The breakage is present in the master branch as well and all tags between v25.8.10.7-lts and the master branch.
There has been some attempt to fix the code via this commit - 741f47b however the issue still persists as described above.
Expected behavior
The DB should get created with the on cluster directive right away without timing out the DDL task query and should complete successfully/
Error message and/or stacktrace
Received exception from server (version 25.8.10):
Code: 159. DB::Exception: Received from localhost:9000. DB::Exception: Distributed DDL task /clickhouse/cc/task_queue/ddl/query-0000000000 is not finished on 2 of 2 hosts (0 of them are currently executing the task, 0 are inactive). They are going to execute the query in background. Was waiting for 180.940991669 seconds, which is longer than distributed_ddl_task_timeout. (TIMEOUT_EXCEEDED)
Since this regression has been carried forward to all tags starting from `v25.8.10.7-lts` - a critical feature remains broken for the foreseeable future.
Since the commit that broke it does not fix anything of significance and was purposed to be an optimization, the request is to either fix this properly or revert the commit altogether.