Skip to content

Fix possible deadlock with OvercommitTracker and send_logs_level#38990

Closed
azat wants to merge 1 commit intoClickHouse:masterfrom
azat:fix-logging-deadlock
Closed

Fix possible deadlock with OvercommitTracker and send_logs_level#38990
azat wants to merge 1 commit intoClickHouse:masterfrom
azat:fix-logging-deadlock

Conversation

@azat
Copy link
Copy Markdown
Member

@azat azat commented Jul 8, 2022

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in official stable or prestable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Fix possible deadlock with OvercommitTracker and send_logs_level

CI founds the following deadlock 1:

Details
Thread 1198 (Thread 0x7f268d800700 (LWP 53404)):
4  std::__1::mutex::lock (this=0x70f000b48a18) at ../contrib/libcxx/src/mutex.cpp:33
5  0x000000000c095566 in std::__1::unique_lock<>::unique_lock (this=0x7f268d7f5970, __m=...) at ../contrib/libcxx/include/__mutex_base:119
6  ConcurrentBoundedQueue<>::emplaceImpl<> () at ../src/Common/ConcurrentBoundedQueue.h:35
7  0x000000000c09245f in ConcurrentBoundedQueue<>::emplace<> (args=..., this=<optimized out>) at ../src/Common/ConcurrentBoundedQueue.h:112
8  DB::OwnSplitChannel::logSplit (this=this@entry=0x707000001f10, msg=...) at ../src/Loggers/OwnSplitChannel.cpp:110
9  0x000000000c091172 in DB::OwnSplitChannel::tryLogSplit (this=0x70f000b48a18, this@entry=0x707000001f10, msg=...) at ../src/Loggers/OwnSplitChannel.cpp:51
10 0x000000000c090e8e in DB::OwnSplitChannel::log (this=0x70f000b48a18, msg=...) at ../src/Loggers/OwnSplitChannel.cpp:43
11 0x000000000ba56bb8 in UserOvercommitTracker::pickQueryToExcludeImpl (this=0x713000160060) at ../src/Common/OvercommitTracker.cpp:181
12 0x000000000ba52766 in OvercommitTracker::pickQueryToExclude (this=0x713000160060) at ../src/Common/OvercommitTracker.h:97
13 OvercommitTracker::needToStopQuery (this=0x713000160060, tracker=0x714003cbb660, amount=0) at ../src/Common/OvercommitTracker.cpp:65

Thread 1014 (Thread 0x7f2698e59700 (LWP 39358)):
4  std::__1::mutex::lock (this=0x713000160068) at ../contrib/libcxx/src/mutex.cpp:33
5  0x000000000ba543a2 in std::__1::lock_guard<std::__1::mutex>::lock_guard (this=0x7f2698e4ea50, __m=...) at ../contrib/libcxx/include/__mutex_base:91
6  OvercommitTracker::tryContinueQueryExecutionAfterFree (this=0x713000160060, amount=1049984) at ../src/Common/OvercommitTracker.cpp:127
7  0x000000000b9c64a0 in MemoryTracker::free (this=0x71300015fff0, size=size@entry=1049984) at ../src/Common/MemoryTracker.cpp:327
8  0x000000000b9c6507 in MemoryTracker::free (this=0x714003cbb660, size=size@entry=1049984) at ../src/Common/MemoryTracker.cpp:330
9  0x000000000b9c6507 in MemoryTracker::free (this=0x7f2698e4f5b0, size=1049984) at ../src/Common/MemoryTracker.cpp:330
10 0x000000000ba13667 in CurrentMemoryTracker::free (size=<optimized out>) at ../src/Common/CurrentMemoryTracker.cpp:106
11 0x000000000ba0aa35 in Allocator<false, false>::free (this=0x703000f9e290, buf=0x720010f52000, size=4096) at ../src/Common/Allocator.h:106
12 0x0000000041a45876 in DB::PODArrayBase<1ul, 4096ul, Allocator<false, false>, 15ul, 16ul>::dealloc (this=0x703000f9e290) at ../src/Common/PODArray.h:145
13 DB::PODArrayBase<1ul, 4096ul, Allocator<false, false>, 15ul, 16ul>::~PODArrayBase (this=0x703000f9e290) at ../src/Common/PODArray.h:310
14 DB::ColumnVector<signed char>::~ColumnVector (this=0x703000f9e280) at ../src/Columns/ColumnVector.h:416
...
28 detail::moveOrCopyIfThrow<> (src=..., dst=...) at ../base/base/../base/MoveOrCopyIfThrow.h:31
29 ConcurrentBoundedQueue<>::popImpl (this=0x70f000b489e8, x=..., timeout_milliseconds=...) at ../src/Common/ConcurrentBoundedQueue.h:83
30 0x0000000043f8b107 in ConcurrentBoundedQueue<>::tryPop (this=0x70f000b489e8, x=..., milliseconds=0) at ../src/Common/ConcurrentBoundedQueue.h:143
31 DB::TCPHandler::sendLogs (this=this@entry=0x71a000901200) at ../src/Server/TCPHandler.cpp:1778
32 0x0000000043f89163 in DB::TCPHandler::processOrdinaryQueryWithProcessors (this=0x713000160068, this@entry=0x71a000901200) at ../src/Server/TCPHandler.cpp:719

Follow-up for: #38246 (cc @novikd )
Follow-up for: #37299
Fixes: #37794

CI founds the following deadlock [1]:

    Thread 1198 (Thread 0x7f268d800700 (LWP 53404)):
    4  std::__1::mutex::lock (this=0x70f000b48a18) at ../contrib/libcxx/src/mutex.cpp:33
    5  0x000000000c095566 in std::__1::unique_lock<>::unique_lock (this=0x7f268d7f5970, __m=...) at ../contrib/libcxx/include/__mutex_base:119
    6  ConcurrentBoundedQueue<>::emplaceImpl<> () at ../src/Common/ConcurrentBoundedQueue.h:35
    7  0x000000000c09245f in ConcurrentBoundedQueue<>::emplace<> (args=..., this=<optimized out>) at ../src/Common/ConcurrentBoundedQueue.h:112
    8  DB::OwnSplitChannel::logSplit (this=this@entry=0x707000001f10, msg=...) at ../src/Loggers/OwnSplitChannel.cpp:110
    9  0x000000000c091172 in DB::OwnSplitChannel::tryLogSplit (this=0x70f000b48a18, this@entry=0x707000001f10, msg=...) at ../src/Loggers/OwnSplitChannel.cpp:51
    10 0x000000000c090e8e in DB::OwnSplitChannel::log (this=0x70f000b48a18, msg=...) at ../src/Loggers/OwnSplitChannel.cpp:43
    11 0x000000000ba56bb8 in UserOvercommitTracker::pickQueryToExcludeImpl (this=0x713000160060) at ../src/Common/OvercommitTracker.cpp:181
    12 0x000000000ba52766 in OvercommitTracker::pickQueryToExclude (this=0x713000160060) at ../src/Common/OvercommitTracker.h:97
    13 OvercommitTracker::needToStopQuery (this=0x713000160060, tracker=0x714003cbb660, amount=0) at ../src/Common/OvercommitTracker.cpp:65

    Thread 1014 (Thread 0x7f2698e59700 (LWP 39358)):
    4  std::__1::mutex::lock (this=0x713000160068) at ../contrib/libcxx/src/mutex.cpp:33
    5  0x000000000ba543a2 in std::__1::lock_guard<std::__1::mutex>::lock_guard (this=0x7f2698e4ea50, __m=...) at ../contrib/libcxx/include/__mutex_base:91
    6  OvercommitTracker::tryContinueQueryExecutionAfterFree (this=0x713000160060, amount=1049984) at ../src/Common/OvercommitTracker.cpp:127
    7  0x000000000b9c64a0 in MemoryTracker::free (this=0x71300015fff0, size=size@entry=1049984) at ../src/Common/MemoryTracker.cpp:327
    8  0x000000000b9c6507 in MemoryTracker::free (this=0x714003cbb660, size=size@entry=1049984) at ../src/Common/MemoryTracker.cpp:330
    9  0x000000000b9c6507 in MemoryTracker::free (this=0x7f2698e4f5b0, size=1049984) at ../src/Common/MemoryTracker.cpp:330
    10 0x000000000ba13667 in CurrentMemoryTracker::free (size=<optimized out>) at ../src/Common/CurrentMemoryTracker.cpp:106
    11 0x000000000ba0aa35 in Allocator<false, false>::free (this=0x703000f9e290, buf=0x720010f52000, size=4096) at ../src/Common/Allocator.h:106
    12 0x0000000041a45876 in DB::PODArrayBase<1ul, 4096ul, Allocator<false, false>, 15ul, 16ul>::dealloc (this=0x703000f9e290) at ../src/Common/PODArray.h:145
    13 DB::PODArrayBase<1ul, 4096ul, Allocator<false, false>, 15ul, 16ul>::~PODArrayBase (this=0x703000f9e290) at ../src/Common/PODArray.h:310
    14 DB::ColumnVector<signed char>::~ColumnVector (this=0x703000f9e280) at ../src/Columns/ColumnVector.h:416
    ...
    28 detail::moveOrCopyIfThrow<> (src=..., dst=...) at ../base/base/../base/MoveOrCopyIfThrow.h:31
    29 ConcurrentBoundedQueue<>::popImpl (this=0x70f000b489e8, x=..., timeout_milliseconds=...) at ../src/Common/ConcurrentBoundedQueue.h:83
    30 0x0000000043f8b107 in ConcurrentBoundedQueue<>::tryPop (this=0x70f000b489e8, x=..., milliseconds=0) at ../src/Common/ConcurrentBoundedQueue.h:143
    31 DB::TCPHandler::sendLogs (this=this@entry=0x71a000901200) at ../src/Server/TCPHandler.cpp:1778
    32 0x0000000043f89163 in DB::TCPHandler::processOrdinaryQueryWithProcessors (this=0x713000160068, this@entry=0x71a000901200) at ../src/Server/TCPHandler.cpp:719

  [1]: https://s3.amazonaws.com/clickhouse-test-reports/38750/40db496ebb7279850be06fabe2eafcc30a2a04cd/stress_test__memory__actions_.html

Signed-off-by: Azat Khuzhin <[email protected]>
@azat azat force-pushed the fix-logging-deadlock branch from e1a5fb5 to 35eeb5a Compare July 8, 2022 06:47
@robot-clickhouse robot-clickhouse added the pr-bugfix Pull request with bugfix, not backported by default label Jul 8, 2022
Comment on lines +110 to 113
logs_queue->emplace(std::move(columns));
}

/// Also log to system.text_log table, if message is not too noisy
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It solves the issue with InternalTextLogsQueue::queue::queue_mutex, but this thread will write message into system.text_log and lock other mutexes (e.g. text_log_mutex and SystemLogBase::mutex), so seems like similar lock-order-inversions are still possible if text_log try to allocate memory and get into OvercommitTracker

@tavplubix tavplubix self-assigned this Jul 8, 2022
@tavplubix
Copy link
Copy Markdown
Member

@novikd, what do you think about moving logging out of the critical methods of OvercommitTracker? OvercommitTracker can pre-allocate some memory to save some debug info, call a critical method, fill debug info and log it after returning from the method.

@novikd
Copy link
Copy Markdown
Member

novikd commented Jul 8, 2022

After a conversation with @tavplubix and @KochetovNicolai I decided it's better just to remove all logging from OvercommitTracker.

@novikd novikd closed this Jul 8, 2022
@azat azat deleted the fix-logging-deadlock branch July 10, 2022 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-bugfix Pull request with bugfix, not backported by default

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deadlock in ProcessList

4 participants