Skip to content

Fix flaky test test_backup_restore_on_cluster#47577

Closed
vitlibar wants to merge 1 commit intomasterfrom
vitlibar-fix-flaky-test_backup_restore_on_cluster
Closed

Fix flaky test test_backup_restore_on_cluster#47577
vitlibar wants to merge 1 commit intomasterfrom
vitlibar-fix-flaky-test_backup_restore_on_cluster

Conversation

@vitlibar
Copy link
Copy Markdown
Member

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

Fix error Table default.tbl doesn't exist in this test which could occur sometimes.

@robot-ch-test-poll4 robot-ch-test-poll4 added the pr-not-for-changelog This PR should not be mentioned in the changelog label Mar 14, 2023
@tavplubix tavplubix self-assigned this Mar 14, 2023
Comment on lines +84 to 90
assert_eq_with_retry(
node0,
"SELECT count() from clusterAllReplicas('cluster', system.tables) where table='tbl' and database='default'",
TSV([num_nodes]),
)
for i in range(num_nodes):
nodes[i].query(f"INSERT INTO tbl VALUES ({i})")
Copy link
Copy Markdown
Member

@tavplubix tavplubix Mar 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CREATE ... ON CLUSTER is synchronous by default: it waits for the query to be executed by all replicas or throws an exception. So the table must already exist at that point, we don't need extra checks.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's something interesting in the text log of node4:

2023.03.13 11:13:09.843597 [ 266 ] {} <Trace> DDLWorker: Too early to clean queue, will do it later.
2023.03.13 11:13:09.843607 [ 275 ] {} <Debug> DDLWorker: Scheduling tasks
2023.03.13 11:13:09.844023 [ 275 ] {} <Trace> DDLWorker: scheduleTasks: initialized=true, size_before_filtering=2, queue_size=2, entries=query-0000000000..query-0000000001, first_failed_task_name=none, current_tasks_size=0, last_current_task=none, last_skipped_entry_name=query-0000000000
2023.03.13 11:13:09.844073 [ 275 ] {} <Debug> DDLWorker: Will schedule 1 tasks starting from query-0000000001
2023.03.13 11:13:09.844108 [ 275 ] {} <Trace> DDLWorker: Checking task query-0000000001
2023.03.13 11:13:09.847102 [ 275 ] {} <Debug> DDLWorker: Will not execute task query-0000000001: There is no a local address in host list
2023.03.13 11:13:09.847134 [ 275 ] {} <Debug> DDLWorker: Waiting for queue updates
2023.03.13 11:13:10.000231 [ 261 ] {} <Trace> AsynchronousMetrics: MemoryTracking: was 760.47 MiB, peak 762.47 MiB, free memory in arenas 0.00 B, will set to 780.64 MiB (RSS), difference: 20.16 MiB
2023.03.13 11:13:10.163435 [ 266 ] {} <Trace> DDLWorker: Too early to clean queue, will do it later.
2023.03.13 11:13:10.163435 [ 275 ] {} <Debug> DDLWorker: Scheduling tasks
2023.03.13 11:13:10.165800 [ 275 ] {} <Trace> DDLWorker: scheduleTasks: initialized=true, size_before_filtering=3, queue_size=3, entries=query-0000000000..query-0000000002, first_failed_task_name=none, current_tasks_size=0, last_current_task=none, last_skipped_entry_name=query-0000000001
2023.03.13 11:13:10.165837 [ 275 ] {} <Debug> DDLWorker: Will schedule 1 tasks starting from query-0000000002
2023.03.13 11:13:10.165869 [ 275 ] {} <Trace> DDLWorker: Checking task query-0000000002
2023.03.13 11:13:10.166897 [ 275 ] {} <Debug> DDLWorker: Will not execute task query-0000000002: There is no a local address in host list
2023.03.13 11:13:10.166945 [ 275 ] {} <Debug> DDLWorker: Waiting for queue updates
2023.03.13 11:13:10.470069 [ 266 ] {} <Trace> DDLWorker: Too early to clean queue, will do it later.
2023.03.13 11:13:10.470074 [ 275 ] {} <Debug> DDLWorker: Scheduling tasks
2023.03.13 11:13:10.472361 [ 275 ] {} <Trace> DDLWorker: scheduleTasks: initialized=true, size_before_filtering=4, queue_size=4, entries=query-0000000000..query-0000000003, first_failed_task_name=none, current_tasks_size=0, last_current_task=none, last_skipped_entry_name=query-0000000002
2023.03.13 11:13:10.472405 [ 275 ] {} <Debug> DDLWorker: Will schedule 1 tasks starting from query-0000000003
2023.03.13 11:13:10.472431 [ 275 ] {} <Trace> DDLWorker: Checking task query-0000000003
2023.03.13 11:13:10.473239 [ 275 ] {} <Debug> DDLWorker: Will not execute task query-0000000003: There is no a local address in host list
2023.03.13 11:13:10.473264 [ 275 ] {} <Debug> DDLWorker: Waiting for queue updates

The most interesting lines are Will not execute task ...: There is no a local address in host list. So something is wrong with the cluster config and/or hostnames

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason test_backup_restore_on_cluster/_instances_concurrency_0/node4/configs/config.d/cluster_for_concurrency_test.xml looks like this:


        <clickhouse>
            <remote_servers>
                <cluster>
                    <shard>
        
                        <replica>
                            <host>node0</host>
                            <port>9000</port>
                        </replica>
            
                        <replica>
                            <host>node1</host>
                            <port>9000</port>
                        </replica>
            
                        <replica>
                            <host>node2</host>
                            <port>9000</port>
                        </replica>
            
                        <replica>
                            <host>node3</host>
                            <port>9000</port>
                        </replica>
            
                    </shard>
                </cluster>
            </remote_servers>
        </clickhouse>
        

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And that's because num_nodes in test_disallow_concurrency.py was changed to 4 in #47216. So the flaky check failure was actually related to the changes.

Copy link
Copy Markdown
Member Author

@vitlibar vitlibar Mar 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it's because of that change
https://github.com/ClickHouse/ClickHouse/pull/47216/files#diff-e0f79cbbfa03684382a2be3d7da44cc585c6c548bc634067b411af295cfc9854R12

Both tests test_disallow_concurrency.py and test_concurrency.py generate cluster config in the same temporary file _gen/cluster_for_concurrency_test.xml.

@tavplubix tavplubix marked this pull request as draft March 14, 2023 15:51
@vitlibar vitlibar closed this Mar 14, 2023
@vitlibar
Copy link
Copy Markdown
Member Author

vitlibar commented Mar 14, 2023

And that's because num_nodes in test_disallow_concurrency.py was changed to 4 in #47216. So the flaky check failure was actually related to the changes.

Ok, you're right

I closed this PR because the fix should be around #47216 and not here

@vitlibar vitlibar deleted the vitlibar-fix-flaky-test_backup_restore_on_cluster branch March 14, 2023 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-not-for-changelog This PR should not be mentioned in the changelog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants