Skip to content

Disable send_logs_level for INSERT into Distributed to avoid possible hung#35075

Merged
alexey-milovidov merged 1 commit intoClickHouse:masterfrom
azat:dist-insert-hung-fix
Jun 19, 2022
Merged

Disable send_logs_level for INSERT into Distributed to avoid possible hung#35075
alexey-milovidov merged 1 commit intoClickHouse:masterfrom
azat:dist-insert-hung-fix

Conversation

@azat
Copy link
Copy Markdown
Member

@azat azat commented Mar 5, 2022

Changelog category (leave one):

  • Bug Fix (user-visible misbehaviour in official stable or prestable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Disable send_logs_level for INSERT into Distributed to avoid possible hung

In case of INSERT into Distributed table with send_logs_level!=none it
is possible to receive tons of Log packets, and w/o consuming it
properly the socket buffer will be full, and eventually the query will
hung.

This happens because receiver will not read data until it will send logs
packets, but sender does not reads those Log packets and so receiver
hung, and hence the sender will hung too, because receiver do not
consume Data packets anymore.

In the initial version of this patch I tried to properly consume Log
packets, but it is not possible to ensure that before writing Data
blocks all Log packets had been consumed, that said that with current
protocol implementation it is not possible to fix Log packets consuming
properly, to avoid deadlock, so send_logs_level had been simply
disabled.

But note, that this does not differs to the user, in what ClickHouse did
before, since before it simply does not consume those packets, so the
client does not saw those messages anyway.

Details

The receiver:

Poco::Net::SocketImpl::poll(Poco::Timespan const&, int)
Poco::Net::SocketImpl::sendBytes(void const*, int, int)
Poco::Net::StreamSocketImpl::sendBytes(void const*, int, int)
DB::WriteBufferFromPocoSocket::nextImpl()
DB::TCPHandler::sendLogData(DB::Block const&)
DB::TCPHandler::sendLogs()
DB::TCPHandler::readDataNext()
DB::TCPHandler::processInsertQuery()

State      Recv-Q  Send-Q          Local Address:Port         Peer Address:Port Process
ESTAB      4331792 211637           127.0.0.1:9000            127.0.0.1:24446 users:(("clickhouse-serv",pid=46874,fd=3850))

The sender:

Poco::Net::SocketImpl::poll(Poco::Timespan const&, int)
Poco::Net::SocketImpl::sendBytes(void const*, int, int)
Poco::Net::StreamSocketImpl::sendBytes(void const*, int, int)
DB::WriteBufferFromPocoSocket::nextImpl()
DB::WriteBuffer::write(char const*, unsigned long)
DB::CompressedWriteBuffer::nextImpl()
DB::WriteBuffer::write(char const*, unsigned long)
DB::SerializationString::serializeBinaryBulk(DB::IColumn const&, DB::WriteBuffer&, unsigned long, unsigned long) const
DB::NativeWriter::write(DB::Block const&)
DB::Connection::sendData(DB::Block const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, bool)
DB::RemoteInserter::write(DB::Block)
DB::RemoteSink::consume(DB::Chunk)
DB::SinkToStorage::onConsume(DB::Chunk)

State      Recv-Q  Send-Q         Local Address:Port         Peer Address:Port Process
ESTAB      67883   3008240           127.0.0.1:24446           127.0.0.1:9000  users:(("clickhouse-serv",pid=41610,fd=25))

@robot-clickhouse robot-clickhouse added the pr-bugfix Pull request with bugfix, not backported by default label Mar 5, 2022
Copy link
Copy Markdown
Member

@alexey-milovidov alexey-milovidov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alexey-milovidov alexey-milovidov self-assigned this Mar 20, 2022
@alexey-milovidov
Copy link
Copy Markdown
Member

@Mergifyio update

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Mar 20, 2022

update

✅ Branch has been successfully updated

@alexey-milovidov
Copy link
Copy Markdown
Member

There are failed checks.
@Mergifyio update

@azat azat force-pushed the dist-insert-hung-fix branch from f3e0550 to 779e7c8 Compare March 26, 2022 20:52
@alexey-milovidov
Copy link
Copy Markdown
Member

alexey-milovidov commented Apr 1, 2022

@Mergifyio update

@alexey-milovidov
Copy link
Copy Markdown
Member

@azat Checks are failed.

@Mergifyio update

@azat
Copy link
Copy Markdown
Member Author

azat commented Apr 3, 2022

@azat azat marked this pull request as draft April 3, 2022 17:18
@azat azat force-pushed the dist-insert-hung-fix branch from 779e7c8 to 98e1e45 Compare April 3, 2022 20:08
@azat azat marked this pull request as ready for review April 3, 2022 20:10
@alexey-milovidov
Copy link
Copy Markdown
Member

@azat azat marked this pull request as draft April 4, 2022 19:58
@azat azat force-pushed the dist-insert-hung-fix branch from 98e1e45 to 81c3be8 Compare April 25, 2022 13:36
@azat
Copy link
Copy Markdown
Member Author

azat commented Apr 25, 2022

@azat azat marked this pull request as ready for review April 25, 2022 13:36
@azat
Copy link
Copy Markdown
Member Author

azat commented Apr 25, 2022

PullRequestCI / TestsBugfixCheck (pull_request) Failing after 12m — TestsBugfixCheck

Now the test w/o this patch does fails:

2022-04-25T14:48:23.2483089Z 2022-04-25 09:48:23 02232_dist_insert_send_logs_level_hung:                                 [ FAIL ] 600.03 sec. - having stderror: 
2022-04-25T14:48:23.2483588Z 2022-04-25 09:48:23 Timeout dropping database test_bnckem after test

Stateless tests (address, actions) [1/2] — Some queries hung, fail: 1, passed: 2010, skipped: 9

However, the fix is not complete:

"elapsed": 1884.992765112,
"query": "insert into function remote('127.2', currentDatabase(), in_02232) select * from numbers(1e6)",
"thread_ids": [
    "21153",
    "39724",
    "39653"
],

"query": "INSERT INTO test_h88210.in_02232 (key) VALUES",
"thread_ids": [
    "3267",
    "3267"
],

Thread 326 (Thread 0x7f4dd2bf7700 (LWP 3267)):
5  0x0000000038d8c548 in Poco::Net::StreamSocketImpl::sendBytes (this=0x6040000d8350, buffer=0x7f4d263ec800, length=419, flags=0) at ../contrib/poco/Net/src/StreamSocketImpl.cpp:63
6  0x000000002b9449df in DB::WriteBufferFromPocoSocket::nextImpl (this=<optimized out>) at ../src/IO/WriteBufferFromPocoSocket.cpp:58
7  0x000000002e3a41b8 in DB::WriteBuffer::next (this=0x7f4dd2beb660) at ../src/IO/WriteBuffer.h:50
8  DB::TCPHandler::sendLogData (this=this@entry=0x61a0014d1880, block=...) at ../src/Server/TCPHandler.cpp:1688
9  0x000000002e38e6b0 in DB::TCPHandler::sendLogs (this=0x7f4dd2beb660) at ../src/Server/TCPHandler.cpp:1759
10 0x000000002e38f7f4 in DB::TCPHandler::readDataNext (this=<optimized out>) at ../src/Server/TCPHandler.cpp:583
11 0x000000002e38bdd2 in DB::TCPHandler::processInsertQuery()::$_0::operator()<DB::PushingPipelineExecutor>(DB::PushingPipelineExecutor&) const (executor=..., this=<optimized out>) at ../src/Server/TCPHandler.cpp:640

Thread 1187 (Thread 0x7f4b23a9e700 (LWP 39653)):
5  0x0000000038d8c548 in Poco::Net::StreamSocketImpl::sendBytes (this=0x6040015e2fd0, buffer=0x7f4d24ef7800, length=54, flags=0) at ../contrib/poco/Net/src/StreamSocketImpl.cpp:63
6  0x000000002b9449df in DB::WriteBufferFromPocoSocket::nextImpl (this=<optimized out>) at ../src/IO/WriteBufferFromPocoSocket.cpp:58
7  0x000000002e13a3c3 in DB::WriteBuffer::next (this=0x7f4b23a93380) at ../src/IO/WriteBuffer.h:50
8  DB::Connection::sendData (this=<optimized out>, block=..., name=..., scalar=<optimized out>) at ../src/Client/Connection.cpp:584
9  0x000000002d77c2da in DB::RemoteInserter::write (this=0x61500595c0f8, block=<error reading variable: Cannot access memory at address 0x1>) at ../src/QueryPipeline/RemoteInserter.cpp:110

@azat azat marked this pull request as draft April 25, 2022 15:49
@alexey-milovidov
Copy link
Copy Markdown
Member

@azat let's continue.

@azat azat changed the title Fix possible Distributed INSERT hung (in case of send_logs_level != none) Disable send_logs_level for INSERT into Distributed to avoid possible hung May 25, 2022
@azat azat force-pushed the dist-insert-hung-fix branch 2 times, most recently from 5f7eb94 to 4a064e5 Compare May 25, 2022 17:06
@azat
Copy link
Copy Markdown
Member Author

azat commented May 25, 2022

So, I've thought about this, and there is no way to consume Log packets properly in case of INSERT into Distributed, so I just disable send_logs_level for those queries. Also note, that this does not differs to the user, in what ClickHouse did before, since before it simply does not consume those packets, so the client does not saw those messages anyway.

@azat azat marked this pull request as ready for review May 25, 2022 17:07
@azat
Copy link
Copy Markdown
Member Author

azat commented May 26, 2022

Stateless tests flaky check (address, actions) — Tests are not finished, fail: 0, passed: 106

Tests are not finished likely becase of timeout (30min):

2022-05-25 16:43:47 Using queries from '/usr/share/clickhouse-test/queries' directory
...
2022-05-25 17:09:37 02232_dist_insert_send_logs_level_hung:                                 [ OK ] 251.83 sec.

And another issue due to 10 seconds was not enough for the sever to stop:

Code: 76. DB::Exception: Cannot lock file /var/lib/clickhouse/status. Another server instance in same directory is already running. (CANNOT_OPEN_FILE)

@azat azat requested a review from alexey-milovidov May 31, 2022 12:09
@azat azat force-pushed the dist-insert-hung-fix branch 2 times, most recently from b2751f9 to c00b66d Compare June 7, 2022 14:21
@azat azat force-pushed the dist-insert-hung-fix branch from c00b66d to 100410c Compare June 7, 2022 18:33
@azat
Copy link
Copy Markdown
Member Author

azat commented Jun 8, 2022

Integration tests (asan, actions) [1/3] — fail: 3, passed: 676, flaky: 0

Stateless tests (thread, actions) [1/3] — fail: 1, passed: 1390, skipped: 11

  • 01183_custom_separated_format_http

Stateless tests (thread, actions) [2/3] — fail: 1, passed: 1382, skipped: 18

  • 01746_long_zstd_http_compression_json_format - flaky

Stateless tests flaky check (address, actions) — Tests are not finished, fail: 0, passed: 105

Timeout - OK

@azat
Copy link
Copy Markdown
Member Author

azat commented Jun 13, 2022

PullRequestCI / TestsBugfixCheck (pull_request) Failing after 55m — TestsBugfixCheck

2022-06-07 22:53:30 02232_dist_insert_send_logs_level_hung:                                 [ FAIL ] 620.02 sec. - having stderror: 
...
2022-06-07 22:54:30 Found hung queries in processlist:
2022-06-07 22:54:30         "elapsed": 678.058238597,
2022-06-07 22:54:30         "is_cancelled": 0,
2022-06-07 22:54:30         "is_all_data_sent": 0,
2022-06-07 22:54:30         "query": "insert into function remote('127.2', currentDatabase(), in_02232) select * from numbers(1e6)",
...

OK, but let's add a no-bc tag, to avoid hung queries in stress tests for backward compatibility check

… hung

In case of INSERT into Distributed table with send_logs_level!=none it
is possible to receive tons of Log packets, and w/o consuming it
properly the socket buffer will be full, and eventually the query will
hung.

This happens because receiver will not read data until it will send logs
packets, but sender does not reads those Log packets and so receiver
hung, and hence the sender will hung too, because receiver do not
consume Data packets anymore.

In the initial version of this patch I tried to properly consume Log
packets, but it is not possible to ensure that before writing Data
blocks all Log packets had been consumed, that said that with current
protocol implementation it is not possible to fix Log packets consuming
properly, to avoid deadlock, so send_logs_level had been simply
disabled.

But note, that this does not differs to the user, in what ClickHouse did
before, since before it simply does not consume those packets, so the
client does not saw those messages anyway.

<details>

The receiver:

    Poco::Net::SocketImpl::poll(Poco::Timespan const&, int)
    Poco::Net::SocketImpl::sendBytes(void const*, int, int)
    Poco::Net::StreamSocketImpl::sendBytes(void const*, int, int)
    DB::WriteBufferFromPocoSocket::nextImpl()
    DB::TCPHandler::sendLogData(DB::Block const&)
    DB::TCPHandler::sendLogs()
    DB::TCPHandler::readDataNext()
    DB::TCPHandler::processInsertQuery()

    State      Recv-Q  Send-Q          Local Address:Port         Peer Address:Port Process
    ESTAB      4331792 211637           127.0.0.1:9000            127.0.0.1:24446 users:(("clickhouse-serv",pid=46874,fd=3850))

The sender:

    Poco::Net::SocketImpl::poll(Poco::Timespan const&, int)
    Poco::Net::SocketImpl::sendBytes(void const*, int, int)
    Poco::Net::StreamSocketImpl::sendBytes(void const*, int, int)
    DB::WriteBufferFromPocoSocket::nextImpl()
    DB::WriteBuffer::write(char const*, unsigned long)
    DB::CompressedWriteBuffer::nextImpl()
    DB::WriteBuffer::write(char const*, unsigned long)
    DB::SerializationString::serializeBinaryBulk(DB::IColumn const&, DB::WriteBuffer&, unsigned long, unsigned long) const
    DB::NativeWriter::write(DB::Block const&)
    DB::Connection::sendData(DB::Block const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, bool)
    DB::RemoteInserter::write(DB::Block)
    DB::RemoteSink::consume(DB::Chunk)
    DB::SinkToStorage::onConsume(DB::Chunk)

    State      Recv-Q  Send-Q         Local Address:Port         Peer Address:Port Process
    ESTAB      67883   3008240           127.0.0.1:24446           127.0.0.1:9000  users:(("clickhouse-serv",pid=41610,fd=25))

</details>

v2: rebase to use clickhouse_client_timeout and add clickhouse_test_wait_queries
v3: use KILL QUERY
v4: adjust the test
v5: disable send_logs_level for INSERT into Distributed
v6: add no-backward-compatibility-check tag
Signed-off-by: Azat Khuzhin <[email protected]>
@azat azat force-pushed the dist-insert-hung-fix branch from 100410c to 7210be1 Compare June 13, 2022 10:45
@azat
Copy link
Copy Markdown
Member Author

azat commented Jun 13, 2022

PullRequestCI / TestsBugfixCheck (pull_request) Failing after 57m — TestsBugfixCheck

2022-06-13T11:04:55.8467948Z 2022-06-13 08:04:55 02232_dist_insert_send_logs_level_hung:                                 [ FAIL ] 620.02 sec. - having stderror: 
...
_csv.Error: field larger than field limit (131072)
...
FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/actions-runner/_work/_temp/tests_bugfix_check/stateless/post_commit_status.tsv'

@alexey-milovidov alexey-milovidov merged commit 354a10d into ClickHouse:master Jun 19, 2022
@azat azat deleted the dist-insert-hung-fix branch June 20, 2022 04:38
azat added a commit to azat/ClickHouse that referenced this pull request Jun 22, 2022
Right now RemoteInserter does not read ProfileEvents for INSERT, it
handles them only after sending the query or on finish.

But ClickHouse#37391 sends them for each INSERT block, but sometimes they can be
no ProfileEvents packet, since it sends only non-empty blocks.

And this adds too much complexity, and anyway ProfileEvents are useless
for the server, so let's send them only if the query is initial (i.e.
send by user).

Note, that it is okay to change the logic of sending ProfileEvents w/o
changing DBMS_TCP_PROTOCOL_VERSION, because there were no public
releases with the original patch included yet.

Fixes: ClickHouse#37391
Refs: ClickHouse#35075
Signed-off-by: Azat Khuzhin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-bugfix Pull request with bugfix, not backported by default

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants