Skip to content

Break some tests#27529

Merged
tavplubix merged 25 commits intomasterfrom
break_some_tests
Aug 26, 2021
Merged

Break some tests#27529
tavplubix merged 25 commits intomasterfrom
break_some_tests

Conversation

@tavplubix
Copy link
Copy Markdown
Member

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

Detailed description / Documentation draft:
Check replication consistency after 993-like tests

@robot-clickhouse robot-clickhouse added the pr-not-for-changelog This PR should not be mentioned in the changelog label Aug 10, 2021
@tavplubix
Copy link
Copy Markdown
Member Author

@Mergifyio update

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Aug 10, 2021

Command update: success

Branch has been successfully updated

@tavplubix
Copy link
Copy Markdown
Member Author

Wow, no related failures! But we should backport all fixes before merging this. And this PR will probably make tests more flaky

@tavplubix tavplubix marked this pull request as ready for review August 18, 2021 09:33
@tavplubix
Copy link
Copy Markdown
Member Author

@Mergifyio update

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Aug 18, 2021

Command update: success

Branch has been successfully updated

@tavplubix
Copy link
Copy Markdown
Member Author

@Mergifyio update

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Aug 18, 2021

Command update: success

Branch has been successfully updated

@tavplubix tavplubix requested a review from alesapin August 20, 2021 08:39
@tavplubix
Copy link
Copy Markdown
Member Author

Stress test (address) — Hung check failed - #27931

@tavplubix
Copy link
Copy Markdown
Member Author

@Mergifyio update

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Aug 20, 2021

Command update: success

Branch has been successfully updated

@tavplubix
Copy link
Copy Markdown
Member Author

Functional stateless tests (release) - Code: 415. DB::Exception: All replicas are lost. (ALL_REPLICAS_LOST)

@alesapin alesapin self-assigned this Aug 24, 2021
@tavplubix
Copy link
Copy Markdown
Member Author

No related failures

@tavplubix
Copy link
Copy Markdown
Member Author

@Mergifyio update

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Aug 25, 2021

Command update: success

Branch has been successfully updated

@tavplubix
Copy link
Copy Markdown
Member Author

Failures are irrelevant.
Stress test (memory) — Hung check failed - looks like deadlock in ZooKeeper::finalize
Let's run tests once again before merging.

@tavplubix
Copy link
Copy Markdown
Member Author

@Mergifyio update

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Aug 25, 2021

Command update: success

Branch has been successfully updated

@tavplubix
Copy link
Copy Markdown
Member Author

Stress test (memory) — Hung check failed - looks like deadlock in ZooKeeper::finalize

At first glance, it really looks like deadlock in ZooKeeper client. There are many threads with stacktrace like this (they are waiting on push_request_mutex):

...
#5  0x0000000038a1af7b in std::__1::lock_guard<std::__1::mutex>::lock_guard (this=0x7f37d7f1ff40, __m=...) at ../contrib/libcxx/include/__mutex_base:91
#6  Coordination::ZooKeeper::pushRequest (this=this@entry=0x718002170800, info=...) at ../src/Common/ZooKeeper/ZooKeeperImpl.cpp:1024
...

One thread (LWP 11707) owns push_request_mutex and it is waiting on ConcurrentBoundedQueue::empty_count (queue size is 1 and queue is full; btw, why size is 1?):

...
#2  0x0000000038a33a9a in Poco::Semaphore::tryWait (this=0x718002170a78, milliseconds=0) at ../contrib/poco/Foundation/include/Poco/Semaphore.h:130
#3  ConcurrentBoundedQueue<Coordination::ZooKeeper::RequestInfo>::tryPush (this=0x7180021709c0, x=..., milliseconds=137) at ../src/Common/ConcurrentBoundedQueue.h:64
#4  0x0000000038a1b3df in Coordination::ZooKeeper::pushRequest (this=this@entry=0x718002170800, info=...) at ../src/Common/ZooKeeper/ZooKeeperImpl.cpp:1029
...

And another thread (LWP 649) tries to finalize connection, but it cannot lock push_request_mutex and therefore cannot set expired flag and cannot clear requests_queue to make LWP 11707 release the lock:

...
#4  std::__1::mutex::lock (this=0x718002170998) at ../contrib/libcxx/src/mutex.cpp:33
#5  0x0000000038a060e0 in std::__1::lock_guard<std::__1::mutex>::lock_guard (this=0x7f37abddfce0, __m=...) at ../contrib/libcxx/include/__mutex_base:91
#6  Coordination::ZooKeeper::finalize(bool, bool)::$_3::operator()() const (this=<optimized out>) at ../src/Common/ZooKeeper/ZooKeeperImpl.cpp:842
#7  Coordination::ZooKeeper::finalize (this=<optimized out>, error_send=<optimized out>, error_receive=false) at ../src/Common/ZooKeeper/ZooKeeperImpl.cpp:873
#8  0x0000000038a36b5e in Coordination::ZooKeeper::finalize (this=0x718002170998) at ../src/Common/ZooKeeper/ZooKeeperImpl.h:190
...

But there is operation_timeout for ConcurrentBoundedQueue::tryPush(...) that should be equal to 30000 or 10000 milliseconds, so such deadlock is not possible. However, gdb shows strange values of milliseconds argument:

#2 ... Poco::Semaphore::tryWait (this=0x711000004b58, milliseconds=0)
#3 ... ConcurrentBoundedQueue<...>::tryPush (this=0x7180021709c0, x=..., milliseconds=137)

It just cannot be 137. And event if it is somehow, it cannot be 0 in the next function. And even if it is somehow, tryWait(...) will not block with 0 timeout.
So I have two hypothesis:

  • We have some obscure memory-safety issue (and gdb shows arguments correctly). I hope it's not true, because it's impossible to debug such issues.
  • It's actually a livelock caused by starvation of the thread 649. Too many threads are trying to lock push_request_mutex and push some request to the queue, each one locks push_request_mutex, waits for 10 or 30 seconds in tryPush(...), fails and releases the lock, but the thread 649 never gets the lock due to unfair implementation of std::mutex (and gdb shows incorrect values for some reason). As a hotfix we can set and check the expired flag before locking the mutex.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-not-for-changelog This PR should not be mentioned in the changelog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants