Accept connections by keeper regardless memory limits#28304
Accept connections by keeper regardless memory limits#28304azat wants to merge 2 commits intoClickHouse:masterfrom
Conversation
CI report one hung check failure [1].
The problem is that keeper server cannot allocate memory for allocating
connection:
```
2021.08.27 12:39:24.593932 [ 13725 ] {} <Error> ServerErrorHandler: Code: 241. DB::Exception: Memory limit (total) exceeded: would use 81.30 GiB (attempt to allocate chunk of 1048591 bytes), maximum: 62.94 GiB. (MEMORY_LIMIT_EXCEEDED), Stack trace (when copying this message, always include the lines below):
0. ./obj-x86_64-linux-gnu/../contrib/libcxx/include/exception:0: Poco::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) @ 0x18f0cf3b in /usr/bin/clickhouse
1. ./obj-x86_64-linux-gnu/../src/Common/Exception.cpp:59: DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool) @ 0x999f7dc in /usr/bin/clickhouse
2. ./obj-x86_64-linux-gnu/../contrib/libcxx/include/string:1444: DB::Exception::Exception<char const*, char const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, long&, std::__1::basic_string<char, std::__1::char_t
raits<char>, std::__1::allocator<char> > >(int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, char const*&&, char const*&&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>
>&&, long&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >&&) @ 0x9986a6b in /usr/bin/clickhouse
3. ./obj-x86_64-linux-gnu/../src/Common/MemoryTracker.cpp:223: MemoryTracker::allocImpl(long, bool) @ 0x998595e in /usr/bin/clickhouse
4. ./obj-x86_64-linux-gnu/../src/Common/CurrentMemoryTracker.cpp:56: CurrentMemoryTracker::(anonymous namespace)::allocImpl(long, bool) @ 0x99c0fca in /usr/bin/clickhouse
5. ./obj-x86_64-linux-gnu/../src/Common/CurrentMemoryTracker.cpp:69: CurrentMemoryTracker::alloc(long) @ 0x99c0e5f in /usr/bin/clickhouse
6. ./obj-x86_64-linux-gnu/../src/Common/Allocator.h:96: DB::Memory<Allocator<false, false> >::alloc() @ 0x99c0d6d in /usr/bin/clickhouse
7. ./obj-x86_64-linux-gnu/../src/IO/BufferWithOwnMemory.h:0: DB::ReadBufferFromFileBase::ReadBufferFromFileBase(unsigned long, char*, unsigned long) @ 0x99d25bb in /usr/bin/clickhouse
8. ./obj-x86_64-linux-gnu/../src/IO/ReadBufferFromFileDescriptor.h:31: DB::SocketInterruptablePollWrapper::SocketInterruptablePollWrapper(Poco::Net::StreamSocket const&) @ 0x15a842b6 in /usr/bin/clickhouse
9. ./obj-x86_64-linux-gnu/../src/Server/KeeperTCPHandler.cpp:0: DB::KeeperTCPHandler::KeeperTCPHandler(DB::IServer&, Poco::Net::StreamSocket const&) @ 0x15a7fa83 in /usr/bin/clickhouse
10. ./obj-x86_64-linux-gnu/../src/Server/KeeperTCPHandlerFactory.h:43: DB::KeeperTCPHandlerFactory::createConnection(Poco::Net::StreamSocket const&) @ 0x9980a14 in /usr/bin/clickhouse
11. ./obj-x86_64-linux-gnu/../contrib/poco/Net/src/TCPServerDispatcher.cpp:0: Poco::Net::TCPServerDispatcher::run() @ 0x18e2b3ac in /usr/bin/clickhouse
```
TL;DR;
<details>
One of hunged queries:
```
2021.08.27 12:46:16.842126 [ 15965 ] {42aaf2e2-ea0d-4533-a2aa-4a787516d419} <Debug> DatabaseCatalog: Waiting for table 8cacf4ff-5867-4b00-8cac-f4ff58672b00 to be finally dropped
```
```
2021.08.27 12:48:23.689859 [ 535 ] {} <Information> DatabaseCatalog: Have 15 tables in drop queue (0 of them are in use), will try drop test_3.parallel_q1 (8cacf4ff-5867-4b00-8cac-f4ff58672b00)
2021.08.27 12:48:23.702307 [ 535 ] {} <Error> DatabaseCatalog: Cannot drop table test_3.parallel_q1 (8cacf4ff-5867-4b00-8cac-f4ff58672b00). Will retry later.: Code: 999. Coordination::Exception: All connection tries failed while connecting to ZooKeeper. nodes: [::1]:9181
```
Keeper threads:
```
6 ConcurrentBoundedQueue<DB::KeeperStorage::RequestForSession>::tryPop (this=0x7b4400004ec0, x=..., milliseconds=137, milliseconds@entry=10000) at ../src/Common/ConcurrentBoundedQueue.h:93
7 0x0000000015e66eb8 in DB::KeeperDispatcher::requestThread (this=<optimized out>, this@entry=0x7b5800000018) at ../src/Coordination/KeeperDispatcher.cpp:48
8 0x0000000015e6d151 in DB::KeeperDispatcher::initialize(Poco::Util::AbstractConfiguration const&, bool)::$_0::operator()() const (this=<optimized out>) at ../src/Coordination/KeeperDispatcher.cpp:252
```
```
6 ConcurrentBoundedQueue<DB::CreateSnapshotTask>::pop (this=this@entry=0x7b58000000e0, x=...) at ../src/Common/ConcurrentBoundedQueue.h:53
7 0x0000000015e68564 in DB::KeeperDispatcher::snapshotThread (this=<optimized out>, this@entry=0x7b5800000018) at ../src/Coordination/KeeperDispatcher.cpp:167
8 0x0000000015e6d691 in DB::KeeperDispatcher::initialize(Poco::Util::AbstractConfiguration const&, bool)::$_2::operator()() const (this=<optimized out>) at ../src/Coordination/KeeperDispatcher.cpp:254
```
```
5 0x0000000015e6bb8a in std::__1::this_thread::sleep_for<long long, std::__1::ratio<1l, 1000l> > (__d=...) at ../contrib/libcxx/include/thread:380
6 DB::KeeperDispatcher::sessionCleanerTask (this=this@entry=0x7b5800000018) at ../src/Coordination/KeeperDispatcher.cpp:394
7 0x0000000015e6d931 in DB::KeeperDispatcher::initialize(Poco::Util::AbstractConfiguration const&, bool)::$_3::operator()() const (this=<optimized out>) at ../src/Coordination/KeeperDispatcher.cpp:275
```
```
11 DB::ThreadSafeQueue<DB::KeeperStorage::ResponseForSession>::tryPop (this=this@entry=0x7b5800000058, response=..., timeout_ms=<optimized out>) at ../src/Coordination/ThreadSafeQueue.h:29
12 0x0000000015e67dc4 in DB::KeeperDispatcher::responseThread (this=<optimized out>, this@entry=0x7b5800000018) at ../src/Coordination/KeeperDispatcher.cpp:144
13 0x0000000015e6d3f1 in DB::KeeperDispatcher::initialize(Poco::Util::AbstractConfiguration const&, bool)::$_1::operator()() const (this=<optimized out>) at ../src/Coordination/KeeperDispatcher.cpp:253
```
</details>
[1]: https://clickhouse-test-reports.s3.yandex.net/28221/19458a0c61b9477e8cad28ac1c3f9bd4e01afa5e/stress_test_(thread).htmlfail1
f6a1526 to
242c511
Compare
Link to the build log for failed builds points to the html report itself. Any clue? |
|
I don't understand the reasoning - why should we do this? |
@alexey-milovidov right now the keeper will just stop accept new connections in case of the server will run out of memory, and what the client will see is just connection timeout, like keeper server hung somewhere (client may think that the server stops accepting connections, i.e. busy loop, lock or similar), while after it will start responding to clients (at least to simple requests like list of nodes) |
|
But the memory limit still makes sense. Why memory is not freed after the limit is reached? Why the system remains in the state where memory is not freed? |
|
In my opinion, server (and not only keeper) should accept connections and reply something sane (at least some error) instead of simply filling socket backlog (always, not only in case of memory pressure).
Initially it has been triggered in stress tests And there was some misaccounting: So when the clickhouse-server synced it's memory tracker information with RSS, it undercommit 17.47GiB, so after this lots of allocations will fail until it will free enough memory (will take a closer look later on this). |
|
Moving code to .cpp LGTM. Let's ask @alesapin |
It's better if server will refuse or close connection. |
|
Agree with @alexey-milovidov. In low-free-memory case |
|
Ok, closing then. |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Accept connections by keeper regardless memory limits
Detailed description:
CI report one hung check failure 1.
The problem is that keeper server cannot allocate memory for allocating
connection:
TL;DR;
Details
One of hunged queries:
Keeper threads:
Cc: @alesapin
NOTE: Consider adding
pr-must-backportlabel, looks useful when keeper is used as a builtin