Skip to content

Accept connections by keeper regardless memory limits#28304

Closed
azat wants to merge 2 commits intoClickHouse:masterfrom
azat:keeper-ignore-memory-limit
Closed

Accept connections by keeper regardless memory limits#28304
azat wants to merge 2 commits intoClickHouse:masterfrom
azat:keeper-ignore-memory-limit

Conversation

@azat
Copy link
Copy Markdown
Member

@azat azat commented Aug 28, 2021

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Accept connections by keeper regardless memory limits

Detailed description:
CI report one hung check failure 1.

The problem is that keeper server cannot allocate memory for allocating
connection:

2021.08.27 12:39:24.593932 [ 13725 ] {} <Error> ServerErrorHandler: Code: 241. DB::Exception: Memory limit (total) exceeded: would use 81.30 GiB (attempt to allocate chunk of 1048591 bytes), maximum: 62.94 GiB. (MEMORY_LIMIT_EXCEEDED), Stack trace (when copying this message, always include the lines below):

0. ./obj-x86_64-linux-gnu/../contrib/libcxx/include/exception:0: Poco::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) @ 0x18f0cf3b in /usr/bin/clickhouse
1. ./obj-x86_64-linux-gnu/../src/Common/Exception.cpp:59: DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool) @ 0x999f7dc in /usr/bin/clickhouse
2. ./obj-x86_64-linux-gnu/../contrib/libcxx/include/string:1444: DB::Exception::Exception<char const*, char const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, long&, std::__1::basic_string<char, std::__1::char_t
raits<char>, std::__1::allocator<char> > >(int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, char const*&&, char const*&&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>
 >&&, long&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >&&) @ 0x9986a6b in /usr/bin/clickhouse
3. ./obj-x86_64-linux-gnu/../src/Common/MemoryTracker.cpp:223: MemoryTracker::allocImpl(long, bool) @ 0x998595e in /usr/bin/clickhouse
4. ./obj-x86_64-linux-gnu/../src/Common/CurrentMemoryTracker.cpp:56: CurrentMemoryTracker::(anonymous namespace)::allocImpl(long, bool) @ 0x99c0fca in /usr/bin/clickhouse
5. ./obj-x86_64-linux-gnu/../src/Common/CurrentMemoryTracker.cpp:69: CurrentMemoryTracker::alloc(long) @ 0x99c0e5f in /usr/bin/clickhouse
6. ./obj-x86_64-linux-gnu/../src/Common/Allocator.h:96: DB::Memory<Allocator<false, false> >::alloc() @ 0x99c0d6d in /usr/bin/clickhouse
7. ./obj-x86_64-linux-gnu/../src/IO/BufferWithOwnMemory.h:0: DB::ReadBufferFromFileBase::ReadBufferFromFileBase(unsigned long, char*, unsigned long) @ 0x99d25bb in /usr/bin/clickhouse
8. ./obj-x86_64-linux-gnu/../src/IO/ReadBufferFromFileDescriptor.h:31: DB::SocketInterruptablePollWrapper::SocketInterruptablePollWrapper(Poco::Net::StreamSocket const&) @ 0x15a842b6 in /usr/bin/clickhouse
9. ./obj-x86_64-linux-gnu/../src/Server/KeeperTCPHandler.cpp:0: DB::KeeperTCPHandler::KeeperTCPHandler(DB::IServer&, Poco::Net::StreamSocket const&) @ 0x15a7fa83 in /usr/bin/clickhouse
10. ./obj-x86_64-linux-gnu/../src/Server/KeeperTCPHandlerFactory.h:43: DB::KeeperTCPHandlerFactory::createConnection(Poco::Net::StreamSocket const&) @ 0x9980a14 in /usr/bin/clickhouse
11. ./obj-x86_64-linux-gnu/../contrib/poco/Net/src/TCPServerDispatcher.cpp:0: Poco::Net::TCPServerDispatcher::run() @ 0x18e2b3ac in /usr/bin/clickhouse

TL;DR;

Details

One of hunged queries:

2021.08.27 12:46:16.842126 [ 15965 ] {42aaf2e2-ea0d-4533-a2aa-4a787516d419} <Debug> DatabaseCatalog: Waiting for table 8cacf4ff-5867-4b00-8cac-f4ff58672b00 to be finally dropped
2021.08.27 12:48:23.689859 [ 535 ] {} <Information> DatabaseCatalog: Have 15 tables in drop queue (0 of them are in use), will try drop test_3.parallel_q1 (8cacf4ff-5867-4b00-8cac-f4ff58672b00)
2021.08.27 12:48:23.702307 [ 535 ] {} <Error> DatabaseCatalog: Cannot drop table test_3.parallel_q1 (8cacf4ff-5867-4b00-8cac-f4ff58672b00). Will retry later.: Code: 999. Coordination::Exception: All connection tries failed while connecting to ZooKeeper. nodes: [::1]:9181

Keeper threads:

6  ConcurrentBoundedQueue<DB::KeeperStorage::RequestForSession>::tryPop (this=0x7b4400004ec0, x=..., milliseconds=137, milliseconds@entry=10000) at ../src/Common/ConcurrentBoundedQueue.h:93
7  0x0000000015e66eb8 in DB::KeeperDispatcher::requestThread (this=<optimized out>, this@entry=0x7b5800000018) at ../src/Coordination/KeeperDispatcher.cpp:48
8  0x0000000015e6d151 in DB::KeeperDispatcher::initialize(Poco::Util::AbstractConfiguration const&, bool)::$_0::operator()() const (this=<optimized out>) at ../src/Coordination/KeeperDispatcher.cpp:252
6  ConcurrentBoundedQueue<DB::CreateSnapshotTask>::pop (this=this@entry=0x7b58000000e0, x=...) at ../src/Common/ConcurrentBoundedQueue.h:53
7  0x0000000015e68564 in DB::KeeperDispatcher::snapshotThread (this=<optimized out>, this@entry=0x7b5800000018) at ../src/Coordination/KeeperDispatcher.cpp:167
8  0x0000000015e6d691 in DB::KeeperDispatcher::initialize(Poco::Util::AbstractConfiguration const&, bool)::$_2::operator()() const (this=<optimized out>) at ../src/Coordination/KeeperDispatcher.cpp:254
5  0x0000000015e6bb8a in std::__1::this_thread::sleep_for<long long, std::__1::ratio<1l, 1000l> > (__d=...) at ../contrib/libcxx/include/thread:380
6  DB::KeeperDispatcher::sessionCleanerTask (this=this@entry=0x7b5800000018) at ../src/Coordination/KeeperDispatcher.cpp:394
7  0x0000000015e6d931 in DB::KeeperDispatcher::initialize(Poco::Util::AbstractConfiguration const&, bool)::$_3::operator()() const (this=<optimized out>) at ../src/Coordination/KeeperDispatcher.cpp:275
11 DB::ThreadSafeQueue<DB::KeeperStorage::ResponseForSession>::tryPop (this=this@entry=0x7b5800000058, response=..., timeout_ms=<optimized out>) at ../src/Coordination/ThreadSafeQueue.h:29
12 0x0000000015e67dc4 in DB::KeeperDispatcher::responseThread (this=<optimized out>, this@entry=0x7b5800000018) at ../src/Coordination/KeeperDispatcher.cpp:144
13 0x0000000015e6d3f1 in DB::KeeperDispatcher::initialize(Poco::Util::AbstractConfiguration const&, bool)::$_1::operator()() const (this=<optimized out>) at ../src/Coordination/KeeperDispatcher.cpp:253

Cc: @alesapin

NOTE: Consider adding pr-must-backport label, looks useful when keeper is used as a builtin

@robot-clickhouse robot-clickhouse added pr-bugfix Pull request with bugfix, not backported by default pr-improvement Pull request with some product improvements and removed pr-bugfix Pull request with bugfix, not backported by default labels Aug 28, 2021
azat added 2 commits August 29, 2021 10:15
CI report one hung check failure [1].

The problem is that keeper server cannot allocate memory for allocating
connection:

```
2021.08.27 12:39:24.593932 [ 13725 ] {} <Error> ServerErrorHandler: Code: 241. DB::Exception: Memory limit (total) exceeded: would use 81.30 GiB (attempt to allocate chunk of 1048591 bytes), maximum: 62.94 GiB. (MEMORY_LIMIT_EXCEEDED), Stack trace (when copying this message, always include the lines below):

0. ./obj-x86_64-linux-gnu/../contrib/libcxx/include/exception:0: Poco::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) @ 0x18f0cf3b in /usr/bin/clickhouse
1. ./obj-x86_64-linux-gnu/../src/Common/Exception.cpp:59: DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool) @ 0x999f7dc in /usr/bin/clickhouse
2. ./obj-x86_64-linux-gnu/../contrib/libcxx/include/string:1444: DB::Exception::Exception<char const*, char const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, long&, std::__1::basic_string<char, std::__1::char_t
raits<char>, std::__1::allocator<char> > >(int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, char const*&&, char const*&&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>
 >&&, long&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >&&) @ 0x9986a6b in /usr/bin/clickhouse
3. ./obj-x86_64-linux-gnu/../src/Common/MemoryTracker.cpp:223: MemoryTracker::allocImpl(long, bool) @ 0x998595e in /usr/bin/clickhouse
4. ./obj-x86_64-linux-gnu/../src/Common/CurrentMemoryTracker.cpp:56: CurrentMemoryTracker::(anonymous namespace)::allocImpl(long, bool) @ 0x99c0fca in /usr/bin/clickhouse
5. ./obj-x86_64-linux-gnu/../src/Common/CurrentMemoryTracker.cpp:69: CurrentMemoryTracker::alloc(long) @ 0x99c0e5f in /usr/bin/clickhouse
6. ./obj-x86_64-linux-gnu/../src/Common/Allocator.h:96: DB::Memory<Allocator<false, false> >::alloc() @ 0x99c0d6d in /usr/bin/clickhouse
7. ./obj-x86_64-linux-gnu/../src/IO/BufferWithOwnMemory.h:0: DB::ReadBufferFromFileBase::ReadBufferFromFileBase(unsigned long, char*, unsigned long) @ 0x99d25bb in /usr/bin/clickhouse
8. ./obj-x86_64-linux-gnu/../src/IO/ReadBufferFromFileDescriptor.h:31: DB::SocketInterruptablePollWrapper::SocketInterruptablePollWrapper(Poco::Net::StreamSocket const&) @ 0x15a842b6 in /usr/bin/clickhouse
9. ./obj-x86_64-linux-gnu/../src/Server/KeeperTCPHandler.cpp:0: DB::KeeperTCPHandler::KeeperTCPHandler(DB::IServer&, Poco::Net::StreamSocket const&) @ 0x15a7fa83 in /usr/bin/clickhouse
10. ./obj-x86_64-linux-gnu/../src/Server/KeeperTCPHandlerFactory.h:43: DB::KeeperTCPHandlerFactory::createConnection(Poco::Net::StreamSocket const&) @ 0x9980a14 in /usr/bin/clickhouse
11. ./obj-x86_64-linux-gnu/../contrib/poco/Net/src/TCPServerDispatcher.cpp:0: Poco::Net::TCPServerDispatcher::run() @ 0x18e2b3ac in /usr/bin/clickhouse
```

TL;DR;

<details>

One of hunged queries:

```
2021.08.27 12:46:16.842126 [ 15965 ] {42aaf2e2-ea0d-4533-a2aa-4a787516d419} <Debug> DatabaseCatalog: Waiting for table 8cacf4ff-5867-4b00-8cac-f4ff58672b00 to be finally dropped
```

```
2021.08.27 12:48:23.689859 [ 535 ] {} <Information> DatabaseCatalog: Have 15 tables in drop queue (0 of them are in use), will try drop test_3.parallel_q1 (8cacf4ff-5867-4b00-8cac-f4ff58672b00)
2021.08.27 12:48:23.702307 [ 535 ] {} <Error> DatabaseCatalog: Cannot drop table test_3.parallel_q1 (8cacf4ff-5867-4b00-8cac-f4ff58672b00). Will retry later.: Code: 999. Coordination::Exception: All connection tries failed while connecting to ZooKeeper. nodes: [::1]:9181
```

Keeper threads:

```
6  ConcurrentBoundedQueue<DB::KeeperStorage::RequestForSession>::tryPop (this=0x7b4400004ec0, x=..., milliseconds=137, milliseconds@entry=10000) at ../src/Common/ConcurrentBoundedQueue.h:93
7  0x0000000015e66eb8 in DB::KeeperDispatcher::requestThread (this=<optimized out>, this@entry=0x7b5800000018) at ../src/Coordination/KeeperDispatcher.cpp:48
8  0x0000000015e6d151 in DB::KeeperDispatcher::initialize(Poco::Util::AbstractConfiguration const&, bool)::$_0::operator()() const (this=<optimized out>) at ../src/Coordination/KeeperDispatcher.cpp:252
```

```
6  ConcurrentBoundedQueue<DB::CreateSnapshotTask>::pop (this=this@entry=0x7b58000000e0, x=...) at ../src/Common/ConcurrentBoundedQueue.h:53
7  0x0000000015e68564 in DB::KeeperDispatcher::snapshotThread (this=<optimized out>, this@entry=0x7b5800000018) at ../src/Coordination/KeeperDispatcher.cpp:167
8  0x0000000015e6d691 in DB::KeeperDispatcher::initialize(Poco::Util::AbstractConfiguration const&, bool)::$_2::operator()() const (this=<optimized out>) at ../src/Coordination/KeeperDispatcher.cpp:254
```

```
5  0x0000000015e6bb8a in std::__1::this_thread::sleep_for<long long, std::__1::ratio<1l, 1000l> > (__d=...) at ../contrib/libcxx/include/thread:380
6  DB::KeeperDispatcher::sessionCleanerTask (this=this@entry=0x7b5800000018) at ../src/Coordination/KeeperDispatcher.cpp:394
7  0x0000000015e6d931 in DB::KeeperDispatcher::initialize(Poco::Util::AbstractConfiguration const&, bool)::$_3::operator()() const (this=<optimized out>) at ../src/Coordination/KeeperDispatcher.cpp:275
```

```
11 DB::ThreadSafeQueue<DB::KeeperStorage::ResponseForSession>::tryPop (this=this@entry=0x7b5800000058, response=..., timeout_ms=<optimized out>) at ../src/Coordination/ThreadSafeQueue.h:29
12 0x0000000015e67dc4 in DB::KeeperDispatcher::responseThread (this=<optimized out>, this@entry=0x7b5800000018) at ../src/Coordination/KeeperDispatcher.cpp:144
13 0x0000000015e6d3f1 in DB::KeeperDispatcher::initialize(Poco::Util::AbstractConfiguration const&, bool)::$_1::operator()() const (this=<optimized out>) at ../src/Coordination/KeeperDispatcher.cpp:253
```

</details>

  [1]: https://clickhouse-test-reports.s3.yandex.net/28221/19458a0c61b9477e8cad28ac1c3f9bd4e01afa5e/stress_test_(thread).htmlfail1
@azat azat force-pushed the keeper-ignore-memory-limit branch from f6a1526 to 242c511 Compare August 29, 2021 07:15
@azat
Copy link
Copy Markdown
Member Author

azat commented Aug 29, 2021

ClickHouse build check — 13/15 builds are OK

Link to the build log for failed builds points to the html report itself. Any clue?

@alexey-milovidov
Copy link
Copy Markdown
Member

I don't understand the reasoning - why should we do this?

@alexey-milovidov alexey-milovidov added the st-discussion When the implementation aspects are not clear or when the PR is on hold due to questions. label Aug 29, 2021
@azat
Copy link
Copy Markdown
Member Author

azat commented Aug 29, 2021

I don't understand the reasoning - why should we do this?

@alexey-milovidov right now the keeper will just stop accept new connections in case of the server will run out of memory, and what the client will see is just connection timeout, like keeper server hung somewhere (client may think that the server stops accepting connections, i.e. busy loop, lock or similar), while after it will start responding to clients (at least to simple requests like list of nodes)

@alexey-milovidov
Copy link
Copy Markdown
Member

But the memory limit still makes sense.

Why memory is not freed after the limit is reached? Why the system remains in the state where memory is not freed?

@azat
Copy link
Copy Markdown
Member Author

azat commented Aug 29, 2021

In my opinion, server (and not only keeper) should accept connections and reply something sane (at least some error) instead of simply filling socket backlog (always, not only in case of memory pressure).

Why memory is not freed after the limit is reached? Why the system remains in the state where memory is not freed?

Initially it has been triggered in stress tests

And there was some misaccounting:

2021.08.27 12:39:22.001066 [ 646 ] {} <Trace> AsynchronousMetrics: MemoryTracking: was 63.81 GiB, peak 66.38 GiB, will set to 81.28 GiB (RSS), difference: 17.47 GiB
2021.08.27 12:39:22.001506 [ 646 ] {} <Debug> MemoryTracker: Current memory usage (total): 81.28 GiB.

So when the clickhouse-server synced it's memory tracker information with RSS, it undercommit 17.47GiB, so after this lots of allocations will fail until it will free enough memory (will take a closer look later on this).

@alexey-milovidov
Copy link
Copy Markdown
Member

Moving code to .cpp LGTM.
The patch adding MemoryTracker::BlockerInThread is questionable - it makes code more complicated and less aproachable by new contributors (they will wonder about this complicated logic and be frightened).

Let's ask @alesapin

@alexey-milovidov
Copy link
Copy Markdown
Member

alexey-milovidov commented Aug 29, 2021

filling socket backlog

It's better if server will refuse or close connection.

@alesapin
Copy link
Copy Markdown
Member

Agree with @alexey-milovidov. In low-free-memory case clickhouse-keeper will be unstable. So I'd prefer not to allow new connections at all.

@azat
Copy link
Copy Markdown
Member Author

azat commented Aug 30, 2021

Ok, closing then.

@azat azat closed this Aug 30, 2021
@azat azat deleted the keeper-ignore-memory-limit branch November 19, 2022 11:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-improvement Pull request with some product improvements st-discussion When the implementation aspects are not clear or when the PR is on hold due to questions.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants