Skip to content

Fix flaky test_delayed_replica_failover#67541

Merged
alexey-milovidov merged 6 commits intomasterfrom
fix-test_delayed_replica_failover
Aug 15, 2024
Merged

Fix flaky test_delayed_replica_failover#67541
alexey-milovidov merged 6 commits intomasterfrom
fix-test_delayed_replica_failover

Conversation

@serxa
Copy link
Copy Markdown
Member

@serxa serxa commented Jul 31, 2024

We now have retries with backoff in ZK client after commit. And because the test uses system.zookeeper to check that Keeper is not available anymore, the test always runs for ~10 minutes and sometimes hits integration test timeout.

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

This closes #54502.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

CI Settings (Only check the boxes if you know what you are doing):

  • Allow: All Required Checks
  • Allow: Stateless tests
  • Allow: Stateful tests
  • Allow: Integration Tests
  • Allow: Performance tests
  • Allow: All Builds
  • Allow: batch 1, 2 for multi-batch jobs
  • Allow: batch 3, 4, 5, 6 for multi-batch jobs

  • Exclude: Style check
  • Exclude: Fast test
  • Exclude: All with ASAN
  • Exclude: All with TSAN, MSAN, UBSAN, Coverage
  • Exclude: All with aarch64, release, debug

  • Do not test
  • Woolen Wolfdog
  • Upload binaries for special builds
  • Disable merge-commit
  • Disable CI cache

@serxa serxa added the 🍃 green ci 🌿 Fixing flaky tests in CI label Jul 31, 2024
@evillique evillique self-assigned this Jul 31, 2024
@robot-ch-test-poll
Copy link
Copy Markdown
Contributor

robot-ch-test-poll commented Jul 31, 2024

This is an automated comment for commit 8e6096d with description of existing statuses. It's updated for the latest CI running

✅ Click here to open a full report in a separate page

Successful checks
Check nameDescriptionStatus
BuildsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
ClickBenchRuns [ClickBench](https://github.com/ClickHouse/ClickBench/) with instant-attach table✅ success
Compatibility checkChecks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help✅ success
Docker keeper imageThe check to build and optionally push the mentioned image to docker hub✅ success
Docs checkBuilds and tests the documentation✅ success
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here✅ success
Flaky testsChecks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integration tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc✅ success
Integration testsThe integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests✅ success
Style checkRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report✅ success
Unit testsRuns the unit tests for different release types✅ success
Upgrade checkRuns stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts✅ success

@alexey-milovidov
Copy link
Copy Markdown
Member

@serxa it failed one out of five times.

@serxa
Copy link
Copy Markdown
Member Author

serxa commented Aug 5, 2024

Okay, now it is 12.67 sec instead of 600s. So it is a different problem now. I'll also try to fix this one.

@serxa
Copy link
Copy Markdown
Member Author

serxa commented Aug 13, 2024

Integration tests flaky check (asan) — Failed tests found: error: 0, passed: 4, fail: 1

We see that 4 runs are successful and the last one, 5th run, is not successful.

The way a flaky check works: it does not restart the server or wipe all the data. Instead it just runs the same test 5 times on the same server and data. So it looks like not all the tests are ready for such an approach.

TLDR. root cause: a part from the previous run was fetched (22:32:01) before we disconnected node_2_1 and node_2_2 and inserted new data (22:32:02) to be fetched

Test logic:

# Hinder replication between replicas of the same shard, but leave the possibility of distributed connection.
pm.partition_instances(node_1_1, node_1_2, port=9009)
pm.partition_instances(node_2_1, node_2_2, port=9009)

node_1_2.query("INSERT INTO replicated VALUES ('2017-05-08', 1)")
node_2_2.query("INSERT INTO replicated VALUES ('2017-05-08', 2)")

time.sleep(1)  # accrue replica delay

assert node_1_1.query("SELECT sum(x) FROM replicated").strip() == "0"
assert node_1_2.query("SELECT sum(x) FROM replicated").strip() == "1"
assert node_2_1.query("SELECT sum(x) FROM replicated").strip() == "0"
assert node_2_2.query("SELECT sum(x) FROM replicated").strip() == "2"

logs for iptables:

2024-08-04 22:32:01 [ 568 ] DEBUG : [network] 25eb7e1a6e13aa52d27b26ca0b5ca4db844787a386fc45ff5bddad4cf7d1f487: ['iptables', '--wait', '-I', 'DOCKER-USER', '1', '-p', 'tcp', '-s', '172.16.1.5', '-d', '172.16.1.7', '--dport', '9009', '-j', 'DROP'] (0):  (network.py:308, _exec_run)
2024-08-04 22:32:01 [ 568 ] DEBUG : [network] 25eb7e1a6e13aa52d27b26ca0b5ca4db844787a386fc45ff5bddad4cf7d1f487: ['iptables', '--wait', '-I', 'DOCKER-USER', '1', '-p', 'tcp', '-s', '172.16.1.7', '-d', '172.16.1.5', '--dport', '9009', '-j', 'DROP'] (0):  (network.py:308, _exec_run)
2024-08-04 22:32:01 [ 568 ] DEBUG : [network] 25eb7e1a6e13aa52d27b26ca0b5ca4db844787a386fc45ff5bddad4cf7d1f487: ['iptables', '--wait', '-I', 'DOCKER-USER', '1', '-p', 'tcp', '-s', '172.16.1.8', '-d', '172.16.1.6', '--dport', '9009', '-j', 'DROP'] (0):  (network.py:308, _exec_run)
2024-08-04 22:32:01 [ 568 ] DEBUG : [network] 25eb7e1a6e13aa52d27b26ca0b5ca4db844787a386fc45ff5bddad4cf7d1f487: ['iptables', '--wait', '-I', 'DOCKER-USER', '1', '-p', 'tcp', '-s', '172.16.1.6', '-d', '172.16.1.8', '--dport', '9009', '-j', 'DROP'] (0):  (network.py:308, _exec_run)

data insertion:

2024-08-04 22:32:02 [ 568 ] DEBUG : Executing query INSERT INTO replicated VALUES ('2017-05-08', 2) on node_2_2 (cluster.py:3517, query)

node_2_1 logs for fetching part that should not be fetched because connection between node_2_1 <-> node_2_2 should be broken (and data was not even inserted yet, so it looks like it reads data from a previous run):

2024.08.04 22:32:01.246382 [ 597 ] {} <Information> default.replicated (ac6c2fad-3433-43d7-bf45-acafb9c9b7a6): Code: 234. DB::Exception: No active replica has part 201705_0_0_0 or covering part (cannot execute queue-0000000000: GET_PART with virtual parts [201705_0_0_0]). (NO_REPLICA_HAS_PART)
2024.08.04 22:32:01.246620 [ 455 ] {} <Trace> default.replicated (ReplicatedMergeTreePartCheckThread): Part 201705_0_0_0 in zookeeper: false, locally: false
2024.08.04 22:32:01.246685 [ 455 ] {} <Information> default.replicated (ReplicatedMergeTreePartCheckThread): Checking if anyone has a part 201705_0_0_0 or covering part.
2024.08.04 22:32:01.247018 [ 455 ] {} <Information> default.replicated (ReplicatedMergeTreePartCheckThread): Found the missing part 201705_0_0_0 at 201705_0_0_0 on node_2_2
2024.08.04 22:32:01.443363 [ 613 ] {} <Debug> default.replicated (ac6c2fad-3433-43d7-bf45-acafb9c9b7a6): Fetching part 201705_0_0_0 from default:/clickhouse/tables/2/replicated/replicas/node_2_2
2024.08.04 22:32:01.445225 [ 613 ] {} <Test> default.replicated (ac6c2fad-3433-43d7-bf45-acafb9c9b7a6) (Fetcher): Disk for fetch is not provided, reserving space using storage balanced reservation
2024.08.04 22:32:01.445261 [ 613 ] {} <Test> default.replicated (ac6c2fad-3433-43d7-bf45-acafb9c9b7a6) (Fetcher): Disk for fetch is not provided, reserving space using TTL rules
2024.08.04 22:32:01.445309 [ 613 ] {} <Trace> default.replicated (ac6c2fad-3433-43d7-bf45-acafb9c9b7a6): Trying to reserve 1.00 MiB using storage policy from min volume index 0
2024.08.04 22:32:01.445379 [ 613 ] {} <Trace> DiskLocal: Reserved 1.00 MiB on local disk `default`, having unreserved 203.51 GiB.
2024.08.04 22:32:01.445410 [ 613 ] {} <Trace> default.replicated (ac6c2fad-3433-43d7-bf45-acafb9c9b7a6) (Fetcher): Disk for fetch is not provided, getting disk from reservation default with type 'local'
2024.08.04 22:32:01.445461 [ 613 ] {} <Debug> default.replicated (ac6c2fad-3433-43d7-bf45-acafb9c9b7a6) (Fetcher): Downloading part 201705_0_0_0 onto disk default.
2024.08.04 22:32:01.445554 [ 613 ] {} <Debug> default.replicated (ac6c2fad-3433-43d7-bf45-acafb9c9b7a6) (Fetcher): Downloading files 11
2024.08.04 22:32:01.448969 [ 613 ] {} <Debug> default.replicated (ac6c2fad-3433-43d7-bf45-acafb9c9b7a6) (Fetcher): Download of part 201705_0_0_0 onto disk default finished.
2024.08.04 22:32:01.449130 [ 613 ] {} <Trace> default.replicated (ac6c2fad-3433-43d7-bf45-acafb9c9b7a6): Renaming temporary part tmp-fetch_201705_0_0_0 to 201705_0_0_0 with tid (1, 1, 00000000-0000-0000-0000-000000000000).
2024.08.04 22:32:01.449237 [ 613 ] {} <Test> default.replicated (ac6c2fad-3433-43d7-bf45-acafb9c9b7a6): preparePartForCommit: inserting 201705_0_0_0 (state PreActive) into data_parts_indexes
2024.08.04 22:32:01.449268 [ 613 ] {} <Debug> default.replicated (ac6c2fad-3433-43d7-bf45-acafb9c9b7a6): Committing part 201705_0_0_0 to zookeeper
2024.08.04 22:32:01.449850 [ 665 ] {} <Trace> virtual void Coordination::ZooKeeperCreateRequest::writeImpl(WriteBuffer &) const: Creating part at path /clickhouse/tables/2/replicated/replicas/node_2_1/parts/201705_0_0_0
2024.08.04 22:32:01.451558 [ 613 ] {} <Debug> default.replicated (ac6c2fad-3433-43d7-bf45-acafb9c9b7a6): Part 201705_0_0_0 committed to zookeeper
2024.08.04 22:32:01.451664 [ 613 ] {} <Debug> default.replicated (ac6c2fad-3433-43d7-bf45-acafb9c9b7a6): Fetched part 201705_0_0_0 from default:/clickhouse/tables/2/replicated/replicas/node_2_2
2024.08.04 22:32:01.451737 [ 485 ] {} <Test> default.replicated (ac6c2fad-3433-43d7-bf45-acafb9c9b7a6) (MergerMutator): Nothing to merge in partition 201705: Did not find any parts to merge (with usual merge selectors)
2024.08.04 22:32:01.451749 [ 613 ] {} <Test> default.replicated (ReplicatedMergeTreeQueue): Removing successful entry queue-0000000000 from queue with type GET_PART with virtual parts [201705_0_0_0]
2024.08.04 22:32:01.451796 [ 613 ] {} <Test> default.replicated (ReplicatedMergeTreeQueue): Adding parts [201705_0_0_0] to current parts
2024.08.04 22:32:01.451800 [ 485 ] {} <Trace> default.replicated (ac6c2fad-3433-43d7-bf45-acafb9c9b7a6) (MergerMutator): Checked 1 partitions, found 0 partitions with parts that may be merged: [] (max_total_size_to_merge=109256116224, merge_with_ttl_allowed=true)
2024.08.04 22:32:01.451835 [ 613 ] {} <Test> default.replicated (ReplicatedMergeTreeQueue): Removing part 201705_0_0_0 from mutations (remove_part: false, remove_covered_parts: true)
2024.08.04 22:32:01.451873 [ 485 ] {} <Trace> default.replicated (ac6c2fad-3433-43d7-bf45-acafb9c9b7a6): Scheduling next merge selecting task after 34568ms, current attempt status: CannotSelect

@alexey-milovidov alexey-milovidov added this pull request to the merge queue Aug 15, 2024
Merged via the queue into master with commit 3cc2f01 Aug 15, 2024
@alexey-milovidov alexey-milovidov deleted the fix-test_delayed_replica_failover branch August 15, 2024 01:55
@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🍃 green ci 🌿 Fixing flaky tests in CI pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test_delayed_replica_failover

7 participants