Fix flaky test test_backup_restore_on_cluster#47577
Conversation
| assert_eq_with_retry( | ||
| node0, | ||
| "SELECT count() from clusterAllReplicas('cluster', system.tables) where table='tbl' and database='default'", | ||
| TSV([num_nodes]), | ||
| ) | ||
| for i in range(num_nodes): | ||
| nodes[i].query(f"INSERT INTO tbl VALUES ({i})") |
There was a problem hiding this comment.
CREATE ... ON CLUSTER is synchronous by default: it waits for the query to be executed by all replicas or throws an exception. So the table must already exist at that point, we don't need extra checks.
There was a problem hiding this comment.
There's something interesting in the text log of node4:
2023.03.13 11:13:09.843597 [ 266 ] {} <Trace> DDLWorker: Too early to clean queue, will do it later.
2023.03.13 11:13:09.843607 [ 275 ] {} <Debug> DDLWorker: Scheduling tasks
2023.03.13 11:13:09.844023 [ 275 ] {} <Trace> DDLWorker: scheduleTasks: initialized=true, size_before_filtering=2, queue_size=2, entries=query-0000000000..query-0000000001, first_failed_task_name=none, current_tasks_size=0, last_current_task=none, last_skipped_entry_name=query-0000000000
2023.03.13 11:13:09.844073 [ 275 ] {} <Debug> DDLWorker: Will schedule 1 tasks starting from query-0000000001
2023.03.13 11:13:09.844108 [ 275 ] {} <Trace> DDLWorker: Checking task query-0000000001
2023.03.13 11:13:09.847102 [ 275 ] {} <Debug> DDLWorker: Will not execute task query-0000000001: There is no a local address in host list
2023.03.13 11:13:09.847134 [ 275 ] {} <Debug> DDLWorker: Waiting for queue updates
2023.03.13 11:13:10.000231 [ 261 ] {} <Trace> AsynchronousMetrics: MemoryTracking: was 760.47 MiB, peak 762.47 MiB, free memory in arenas 0.00 B, will set to 780.64 MiB (RSS), difference: 20.16 MiB
2023.03.13 11:13:10.163435 [ 266 ] {} <Trace> DDLWorker: Too early to clean queue, will do it later.
2023.03.13 11:13:10.163435 [ 275 ] {} <Debug> DDLWorker: Scheduling tasks
2023.03.13 11:13:10.165800 [ 275 ] {} <Trace> DDLWorker: scheduleTasks: initialized=true, size_before_filtering=3, queue_size=3, entries=query-0000000000..query-0000000002, first_failed_task_name=none, current_tasks_size=0, last_current_task=none, last_skipped_entry_name=query-0000000001
2023.03.13 11:13:10.165837 [ 275 ] {} <Debug> DDLWorker: Will schedule 1 tasks starting from query-0000000002
2023.03.13 11:13:10.165869 [ 275 ] {} <Trace> DDLWorker: Checking task query-0000000002
2023.03.13 11:13:10.166897 [ 275 ] {} <Debug> DDLWorker: Will not execute task query-0000000002: There is no a local address in host list
2023.03.13 11:13:10.166945 [ 275 ] {} <Debug> DDLWorker: Waiting for queue updates
2023.03.13 11:13:10.470069 [ 266 ] {} <Trace> DDLWorker: Too early to clean queue, will do it later.
2023.03.13 11:13:10.470074 [ 275 ] {} <Debug> DDLWorker: Scheduling tasks
2023.03.13 11:13:10.472361 [ 275 ] {} <Trace> DDLWorker: scheduleTasks: initialized=true, size_before_filtering=4, queue_size=4, entries=query-0000000000..query-0000000003, first_failed_task_name=none, current_tasks_size=0, last_current_task=none, last_skipped_entry_name=query-0000000002
2023.03.13 11:13:10.472405 [ 275 ] {} <Debug> DDLWorker: Will schedule 1 tasks starting from query-0000000003
2023.03.13 11:13:10.472431 [ 275 ] {} <Trace> DDLWorker: Checking task query-0000000003
2023.03.13 11:13:10.473239 [ 275 ] {} <Debug> DDLWorker: Will not execute task query-0000000003: There is no a local address in host list
2023.03.13 11:13:10.473264 [ 275 ] {} <Debug> DDLWorker: Waiting for queue updates
The most interesting lines are Will not execute task ...: There is no a local address in host list. So something is wrong with the cluster config and/or hostnames
There was a problem hiding this comment.
For some reason test_backup_restore_on_cluster/_instances_concurrency_0/node4/configs/config.d/cluster_for_concurrency_test.xml looks like this:
<clickhouse>
<remote_servers>
<cluster>
<shard>
<replica>
<host>node0</host>
<port>9000</port>
</replica>
<replica>
<host>node1</host>
<port>9000</port>
</replica>
<replica>
<host>node2</host>
<port>9000</port>
</replica>
<replica>
<host>node3</host>
<port>9000</port>
</replica>
</shard>
</cluster>
</remote_servers>
</clickhouse>
There was a problem hiding this comment.
And that's because num_nodes in test_disallow_concurrency.py was changed to 4 in #47216. So the flaky check failure was actually related to the changes.
There was a problem hiding this comment.
It seems it's because of that change
https://github.com/ClickHouse/ClickHouse/pull/47216/files#diff-e0f79cbbfa03684382a2be3d7da44cc585c6c548bc634067b411af295cfc9854R12
Both tests test_disallow_concurrency.py and test_concurrency.py generate cluster config in the same temporary file _gen/cluster_for_concurrency_test.xml.
Changelog category (leave one):
Fix error
Table default.tbl doesn't existin this test which could occur sometimes.