Skip to content

Fix hung check in stress test#53090

Merged
tavplubix merged 3 commits intomasterfrom
fix-stress-test
Aug 10, 2023
Merged

Fix hung check in stress test#53090
tavplubix merged 3 commits intomasterfrom
fix-stress-test

Conversation

@alexey-milovidov
Copy link
Copy Markdown
Member

@robot-ch-test-poll robot-ch-test-poll added the pr-not-for-changelog This PR should not be mentioned in the changelog label Aug 6, 2023
@robot-ch-test-poll
Copy link
Copy Markdown
Contributor

robot-ch-test-poll commented Aug 6, 2023

This is an automated comment for commit 47b5d13 with description of existing statuses. It's updated for the latest CI running
The full report is available here
The overall status of the commit is 🟡 pending

Check nameDescriptionStatus
CI runningA meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR🟡 pending
Mergeable CheckChecks if all other necessary checks are successful🟢 success

@antonio2368
Copy link
Copy Markdown
Member

Isn't the failure you linked to because of

<Warning> test_3.mut (cd77dfda-f90b-4736-afb1-73f1cdb21fe5): Part all_0_0_0 exists in ZooKeeper and covered by another part in ZooKeeper (all_0_0_0_1), but doesn't exist on any disk. It may cause false-positive 'part is lost forever' messages

after which there is a call chassert(false) causing the server to terminate.

I saw a lot of failures like this recently in stress tests.
@tavplubix if you are interested

@tavplubix
Copy link
Copy Markdown
Member

That's weird. The part was removed from ZooKeeper and from disk:

2023.08.06 06:36:50.388558 [ 1827 ] {} <Trace> test_3.mut (cd77dfda-f90b-4736-afb1-73f1cdb21fe5): Found 1 old parts to remove. Parts: [all_0_0_0]
2023.08.06 06:36:50.393222 [ 1827 ] {} <Debug> test_3.mut (cd77dfda-f90b-4736-afb1-73f1cdb21fe5): Removing 1 old parts from ZooKeeper
2023.08.06 06:36:50.393991 [ 1827 ] {} <Debug> test_3.mut (cd77dfda-f90b-4736-afb1-73f1cdb21fe5): Removed 1 old parts from ZooKeeper. Removing them from filesystem.
2023.08.06 06:36:50.394050 [ 1827 ] {} <Debug> test_3.mut (cd77dfda-f90b-4736-afb1-73f1cdb21fe5): Removing 1 parts from filesystem (serially): Parts: [all_0_0_0]
2023.08.06 06:36:50.394222 [ 1827 ] {} <Trace> test_3.mut (cd77dfda-f90b-4736-afb1-73f1cdb21fe5): Part all_0_0_0 local references is zero, will check blobs can be removed in zookeeper
2023.08.06 06:36:50.394465 [ 1827 ] {} <Trace> test_3.mut (cd77dfda-f90b-4736-afb1-73f1cdb21fe5): Remove zookeeper lock /clickhouse/zero_copy/zero_copy_s3/cd77dfda-f90b-4736-afb1-73f1cdb21fe5/all_0_0_0/test_tpv_wetxcwavwlbhvvfdiskfgsgjdodys/1 for part all_0_0_0
2023.08.06 06:36:50.394933 [ 1827 ] {} <Trace> test_3.mut (cd77dfda-f90b-4736-afb1-73f1cdb21fe5): No more children left for for /clickhouse/zero_copy/zero_copy_s3/cd77dfda-f90b-4736-afb1-73f1cdb21fe5/all_0_0_0/test_tpv_wetxcwavwlbhvvfdiskfgsgjdodys, will try to remove the whole node
2023.08.06 06:36:50.395211 [ 1827 ] {} <Trace> test_3.mut (cd77dfda-f90b-4736-afb1-73f1cdb21fe5): Removed last parent zookeeper lock /clickhouse/zero_copy/zero_copy_s3/cd77dfda-f90b-4736-afb1-73f1cdb21fe5/all_0_0_0/test_tpv_wetxcwavwlbhvvfdiskfgsgjdodys for part all_0_0_0 with id test_tpv_wetxcwavwlbhvvfdiskfgsgjdodys
2023.08.06 06:36:50.395615 [ 1827 ] {} <Trace> test_3.mut (cd77dfda-f90b-4736-afb1-73f1cdb21fe5): Removed last parent zookeeper lock /clickhouse/zero_copy/zero_copy_s3/cd77dfda-f90b-4736-afb1-73f1cdb21fe5/all_0_0_0 for part all_0_0_0 (part is finally unlocked)
2023.08.06 06:36:50.414628 [ 1827 ] {} <Debug> test_3.mut (cd77dfda-f90b-4736-afb1-73f1cdb21fe5): Removing 1 parts from memory: Parts: [all_0_0_0]
2023.08.06 06:36:50.414650 [ 1827 ] {} <Debug> test_3.mut (cd77dfda-f90b-4736-afb1-73f1cdb21fe5): Removed 1 old parts
2023.08.06 06:36:50.415462 [ 1827 ] {} <Trace> test_3.mut (ReplicatedMergeTreeCleanupThread): Scheduling next cleanup after 60000ms (points: 0.25641027, interval: 62599ms, ratio: 0.025641028, points per minute: 0.24576458)

But after restarting the server we found it in ZooKeeper:

2023.08.06 06:49:43.003676 [ 139598 ] {} <Trace> test_3.mut (cd77dfda-f90b-4736-afb1-73f1cdb21fe5): Loading Active part all_0_0_0_1 from disk s3
2023.08.06 06:49:43.021791 [ 139598 ] {} <Trace> test_3.mut (cd77dfda-f90b-4736-afb1-73f1cdb21fe5): Finished loading Active part all_0_0_0_1 on disk s3
2023.08.06 06:49:44.249648 [ 139279 ] {} <Warning> test_3.mut (cd77dfda-f90b-4736-afb1-73f1cdb21fe5): Part all_0_0_0 exists in ZooKeeper and covered by another part in ZooKeeper (all_0_0_0_1), but doesn't exist on any disk. It may cause false-positive 'part is lost forever' messages

Will try to find which log message is lying...

@tavplubix
Copy link
Copy Markdown
Member

🤔

99:2023.08.06 06:49:38.743004 [ 139209 ] {} <Warning> KeeperLogStore: Cannot completely read changelog on path changelog_1_100000.bin.zstd, error: ZSTD stream decoding failed: error 'Unknown frame descriptor'; ZSTD version: 1.5.5: While reading from: /var/lib/clickhouse/coordination/logs/changelog_1_100000.bin.zstd
101:2023.08.06 06:49:38.743036 [ 139209 ] {} <Warning> KeeperLogStore: Removing all changelogs
102:2023.08.06 06:49:38.743060 [ 139209 ] {} <Warning> KeeperLogStore: Moving broken logs to detached/20230806T064938
103:2023.08.06 06:49:38.743139 [ 139209 ] {} <Warning> KeeperLogStore: Removing changelog changelog_1_100000.bin.zstd
104:2023.08.06 06:49:38.743197 [ 139209 ] {} <Warning> KeeperLogStore: Removing changelog changelog_100001_200000.bin.zstd

@alexey-milovidov
Copy link
Copy Markdown
Member Author

@antonio2368 I've noticed an exception about too much time spent by sleepEachRow function, which is new after #34909, which was merged just a week ago. This PR only addressed that problem.

@tavplubix
Copy link
Copy Markdown
Member

Stress tests were OK on the previous commit, failed integration tests are known to be flaky

@tavplubix tavplubix merged commit d0ee88f into master Aug 10, 2023
@tavplubix tavplubix deleted the fix-stress-test branch August 10, 2023 14:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-not-for-changelog This PR should not be mentioned in the changelog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants