DDLWorker fails task execution with `on cluster` (regression in release tag `v25.8.10.7-lts`)

### Company or project name

Cisco Systems Inc

### Describe what's wrong

With the Release tag - `v25.8.10.7-lts`, all DDLs with `ON CLUSTER` directives fail on first time installation with the error - 
`Received exception from server (version 25.8.10):
Code: 159. DB::Exception: Received from localhost:9000. DB::Exception: Distributed DDL task /clickhouse/cc/task_queue/ddl/query-0000000000 is not finished on 2 of 2 hosts (0 of them are currently executing the task, 0 are inactive). They are going to execute the query in background. Was waiting for 180.940991669 seconds, which is longer than distributed_ddl_task_timeout. (TIMEOUT_EXCEEDED)`

The task never completes and debug logs report the following in the clickhouse-server logs -
`2025.11.05 23:25:52.084983 [ 757 ] {} <Debug> DDLWorker: Will not execute task query-0000000002: There is no a local address in host list`

The workaround is to restart the click house-server workload and then all DDLWorker task queries succeed as expected .

The regression has been introduced with commit id `be89d268ad35c7c593a00c3489dbd72865af880a`
https://github.com/ClickHouse/ClickHouse/commit/be89d268ad35c7c593a00c3489dbd72865af880a
PR - https://github.com/ClickHouse/ClickHouse/pull/88153/files?diff=split&w=0



### Does it reproduce on the most recent release?

Yes

### How to reproduce

On a setup deployed with docker image `clickhouse/clickhouse-keeper:25.8.10.7` on kubernetes cluster with
3-node clickhouse-keeper
2-node clickhouse-server

After the clickhouse-server launches, using the clickhouse-client execute a DDL query to create a db or a table

```
ClickHouse client version 25.8.10.7 (official build).
Connecting to localhost:9000 as user chuser.
Connected to ClickHouse server version 25.8.10.

:) create database testdb on cluster 'cdr'

CREATE DATABASE testdb ON CLUSTER cdr

Query id: 23e7b9f6-9909-46fa-b385-41c473cf7c6c

Elapsed: 180.970 sec.

Received exception from server (version 25.8.10):
Code: 159. DB::Exception: Received from localhost:9000. DB::Exception: Distributed DDL task /clickhouse/cc/task_queue/ddl/query-0000000000 is not finished on 2 of 2 hosts (0 of them are currently executing the task, 0 are inactive). They are going to execute the query in background. Was waiting for 180.940991669 seconds, which is longer than distributed_ddl_task_timeout. (TIMEOUT_EXCEEDED)
```

Debug logs - 
```
2025.11.07 03:40:28.872318 [ 750 ] {} <Debug> DDLWorker: Scheduling tasks
2025.11.07 03:40:28.872331 [ 753 ] {} <Debug> DDLWorker: Cleaning queue
2025.11.07 03:40:28.873266 [ 750 ] {} <Debug> DDLWorker: Will schedule 1 tasks starting from query-0000000000
2025.11.07 03:40:28.874895 [ 750 ] {} <Debug> DDLWorker: Will not execute task query-0000000000: There is no a local address in host list
2025.11.07 03:40:28.874912 [ 750 ] {} <Debug> DDLWorker: Waiting for queue updates
```

Zookeeper task queue details -
```
SELECT *
FROM system.zookeeper
WHERE path = '/clickhouse/cc/task_queue/ddl'

Query id: 751f0c29-d9c9-49e4-9fad-499dd629eadf
   ┌─name─────────────┬─value──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─path──────────────────────────┐
1. │ query-0000000000 │ version: 5                                                                                                                                                                                                                                                ↴│ /clickhouse/cc/task_queue/ddl │
   │                  │↳query: CREATE DATABASE testdb UUID \'524fd5ae-c174-4bd6-9908-33eb0a26d62c\' ON CLUSTER cdr                                                                                                                                                                ↴│                               │
   │                  │↳hosts: ['chi%2Dcc%2Dcdr%2D0%2D0:9000','chi%2Dcc%2Dcdr%2D0%2D1:9000']                                                                                                                                                                                      ↴│                               │
   │                  │↳initiator: chi%2Dcc%2Dcdr%2D0%2D0%2D0%2Echi%2Dcc%2Dcdr%2D0%2D0%2Eclickhouse%2Esvc%2Ecluster%2Elocal:9000                                                                                                                                                  ↴│                               │
   │                  │↳settings: connect_timeout_with_failover_ms = 1000, load_balancing = 'nearest_hostname', distributed_aggregation_memory_efficient = true, allow_experimental_time_time64_type = true, do_not_merge_across_partitions_select_final = true, os_thread_priority = 2, log_queries = false, insert_deduplicate = true, final = false, prefer_localhost_replica = false, parallel_view_processing = true, date_time_output_format = 'iso'↴│                               │
   │                  │↳tracing: 00000000-0000-0000-0000-000000000000                                                                                                                                                                                                             ↴│                               │
   │                  │↳0                                                                                                                                                                                                                                                         ↴│                               │
   │                  │↳                                                                                                                                                                                                                                                          ↴│                               │
   │                  │↳0                                                                                                                                                                                                                                                         ↴│                               │
   │                  │↳initial_query_id: 23e7b9f6-9909-46fa-b385-41c473cf7c6c                                                                                                                                                                                                    ↴│                               │
   └──────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴───────────────────────────────┘
```

Have verified all hostname fqdns match up with the settings in the `<remote-servers>` configs.
There has been no configuration change since this breakage.
The last known working version before this regression has been `25.8.9`

Have confirmed the introduction of this regression by reverting the offending commit `be89d268ad35c7c593a00c3489dbd72865af880a` on a private branch off of `v25.8.10.7-lts`.

The breakage is present in the master branch as well and all tags between `v25.8.10.7-lts` and the master branch.

There has been some attempt to fix the code via this commit - https://github.com/ClickHouse/ClickHouse/commit/741f47b6fb6b8746563e7383eda7aa33bb18c338 however the issue still persists as described above.

### Expected behavior

The DB should get created with the `on cluster` directive right away without timing out the DDL task query and should complete successfully/

### Error message and/or stacktrace

```
Received exception from server (version 25.8.10):
Code: 159. DB::Exception: Received from localhost:9000. DB::Exception: Distributed DDL task /clickhouse/cc/task_queue/ddl/query-0000000000 is not finished on 2 of 2 hosts (0 of them are currently executing the task, 0 are inactive). They are going to execute the query in background. Was waiting for 180.940991669 seconds, which is longer than distributed_ddl_task_timeout. (TIMEOUT_EXCEEDED)

Since this regression has been carried forward to all tags starting from `v25.8.10.7-lts` - a critical feature remains broken for the foreseeable future.

Since the commit that broke it does not fix anything of significance and was purposed to be an optimization, the request is to either fix this properly or revert the commit altogether.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDLWorker fails task execution with `on cluster` (regression in release tag `v25.8.10.7-lts`) #89693

Company or project name

Describe what's wrong

Does it reproduce on the most recent release?

How to reproduce

Expected behavior

Error message and/or stacktrace

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DDLWorker fails task execution with on cluster (regression in release tag v25.8.10.7-lts) #89693

Description

Company or project name

Describe what's wrong

Does it reproduce on the most recent release?

How to reproduce

Expected behavior

Error message and/or stacktrace

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

DDLWorker fails task execution with `on cluster` (regression in release tag `v25.8.10.7-lts`) #89693