Fix flaky test_delayed_replica_failover#67541
Conversation
|
This is an automated comment for commit 8e6096d with description of existing statuses. It's updated for the latest CI running ✅ Click here to open a full report in a separate page Successful checks
|
|
@serxa it failed one out of five times. |
|
Okay, now it is 12.67 sec instead of 600s. So it is a different problem now. I'll also try to fix this one. |
We see that 4 runs are successful and the last one, 5th run, is not successful. The way a flaky check works: it does not restart the server or wipe all the data. Instead it just runs the same test 5 times on the same server and data. So it looks like not all the tests are ready for such an approach. TLDR. root cause: a part from the previous run was fetched (22:32:01) before we disconnected node_2_1 and node_2_2 and inserted new data (22:32:02) to be fetched Test logic: # Hinder replication between replicas of the same shard, but leave the possibility of distributed connection.
pm.partition_instances(node_1_1, node_1_2, port=9009)
pm.partition_instances(node_2_1, node_2_2, port=9009)
node_1_2.query("INSERT INTO replicated VALUES ('2017-05-08', 1)")
node_2_2.query("INSERT INTO replicated VALUES ('2017-05-08', 2)")
time.sleep(1) # accrue replica delay
assert node_1_1.query("SELECT sum(x) FROM replicated").strip() == "0"
assert node_1_2.query("SELECT sum(x) FROM replicated").strip() == "1"
assert node_2_1.query("SELECT sum(x) FROM replicated").strip() == "0"
assert node_2_2.query("SELECT sum(x) FROM replicated").strip() == "2"logs for iptables: data insertion: node_2_1 logs for fetching part that should not be fetched because connection between node_2_1 <-> node_2_2 should be broken (and data was not even inserted yet, so it looks like it reads data from a previous run): |
We now have retries with backoff in ZK client after commit. And because the test uses
system.zookeeperto check that Keeper is not available anymore, the test always runs for ~10 minutes and sometimes hits integration test timeout.Changelog category (leave one):
This closes #54502.
Documentation entry for user-facing changes
CI Settings (Only check the boxes if you know what you are doing):