Disable send_logs_level for INSERT into Distributed to avoid possible hung by azat · Pull Request #35075 · ClickHouse/ClickHouse

azat · 2022-03-05T17:09:05Z

Changelog category (leave one):

Bug Fix (user-visible misbehaviour in official stable or prestable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Disable send_logs_level for INSERT into Distributed to avoid possible hung

In case of INSERT into Distributed table with send_logs_level!=none it
is possible to receive tons of Log packets, and w/o consuming it
properly the socket buffer will be full, and eventually the query will
hung.

This happens because receiver will not read data until it will send logs
packets, but sender does not reads those Log packets and so receiver
hung, and hence the sender will hung too, because receiver do not
consume Data packets anymore.

In the initial version of this patch I tried to properly consume Log
packets, but it is not possible to ensure that before writing Data
blocks all Log packets had been consumed, that said that with current
protocol implementation it is not possible to fix Log packets consuming
properly, to avoid deadlock, so send_logs_level had been simply
disabled.

But note, that this does not differs to the user, in what ClickHouse did
before, since before it simply does not consume those packets, so the
client does not saw those messages anyway.

Details

The receiver:

Poco::Net::SocketImpl::poll(Poco::Timespan const&, int)
Poco::Net::SocketImpl::sendBytes(void const*, int, int)
Poco::Net::StreamSocketImpl::sendBytes(void const*, int, int)
DB::WriteBufferFromPocoSocket::nextImpl()
DB::TCPHandler::sendLogData(DB::Block const&)
DB::TCPHandler::sendLogs()
DB::TCPHandler::readDataNext()
DB::TCPHandler::processInsertQuery()

State      Recv-Q  Send-Q          Local Address:Port         Peer Address:Port Process
ESTAB      4331792 211637           127.0.0.1:9000            127.0.0.1:24446 users:(("clickhouse-serv",pid=46874,fd=3850))

The sender:

Poco::Net::SocketImpl::poll(Poco::Timespan const&, int)
Poco::Net::SocketImpl::sendBytes(void const*, int, int)
Poco::Net::StreamSocketImpl::sendBytes(void const*, int, int)
DB::WriteBufferFromPocoSocket::nextImpl()
DB::WriteBuffer::write(char const*, unsigned long)
DB::CompressedWriteBuffer::nextImpl()
DB::WriteBuffer::write(char const*, unsigned long)
DB::SerializationString::serializeBinaryBulk(DB::IColumn const&, DB::WriteBuffer&, unsigned long, unsigned long) const
DB::NativeWriter::write(DB::Block const&)
DB::Connection::sendData(DB::Block const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, bool)
DB::RemoteInserter::write(DB::Block)
DB::RemoteSink::consume(DB::Chunk)
DB::SinkToStorage::onConsume(DB::Chunk)

State      Recv-Q  Send-Q         Local Address:Port         Peer Address:Port Process
ESTAB      67883   3008240           127.0.0.1:24446           127.0.0.1:9000  users:(("clickhouse-serv",pid=41610,fd=25))

alexey-milovidov

LGTM

alexey-milovidov · 2022-03-20T19:28:57Z

@Mergifyio update

mergify · 2022-03-20T19:29:56Z

update

✅ Branch has been successfully updated

alexey-milovidov · 2022-03-25T01:32:46Z

There are failed checks.
@Mergifyio update

alexey-milovidov · 2022-04-01T01:35:10Z

@Mergifyio update

alexey-milovidov · 2022-04-03T02:42:25Z

@azat Checks are failed.

@Mergifyio update

azat · 2022-04-03T17:18:21Z

Stateless tests (debug, actions) [2/3] — fail: 1, passed: 1307, skipped: 18

01091_num_threads - Make some tests more stable #35599

Stateless tests (thread, actions) [2/3] — Some queries hung, fail: 2, passed: 1306, skipped: 18
Stateless tests (ubsan, actions) — fail: 1, passed: 3918, skipped: 19

02149_read_in_order_fixed_prefix - Make some tests more stable #35599
01514_parallel_formatting - looks like a real issue

Stateless tests (thread, actions) [3/3] — fail: 2, passed: 1280, skipped: 7

01605_adaptive_granularity_block_borders
00113_shard_group_array

Stateless tests flaky check (address, actions) — Timeout, fail: 1, passed: 0

02232_dist_insert_send_logs_level_hung - After proper waiting of the query it takes too much...

alexey-milovidov · 2022-04-04T04:01:28Z

@azat bugfix check is saying that the test passed on prev build: https://s3.amazonaws.com/clickhouse-test-reports/35075/98e1e45f860e96edd3bf002c47fa827bd8a61c69/stateless_tests_bugfix_validate_check.html

azat · 2022-04-25T13:36:50Z

@azat bugfix check is saying that the test passed on prev build: https://s3.amazonaws.com/clickhouse-test-reports/35075/98e1e45f860e96edd3bf002c47fa827bd8a61c69/stateless_tests_bugfix_validate_check.html

Yep, the problem was in the test, fixed.

azat · 2022-04-25T15:48:56Z

PullRequestCI / TestsBugfixCheck (pull_request) Failing after 12m — TestsBugfixCheck

Now the test w/o this patch does fails:

2022-04-25T14:48:23.2483089Z 2022-04-25 09:48:23 02232_dist_insert_send_logs_level_hung:                                 [ FAIL ] 600.03 sec. - having stderror: 
2022-04-25T14:48:23.2483588Z 2022-04-25 09:48:23 Timeout dropping database test_bnckem after test

Stateless tests (address, actions) [1/2] — Some queries hung, fail: 1, passed: 2010, skipped: 9

However, the fix is not complete:

"elapsed": 1884.992765112,
"query": "insert into function remote('127.2', currentDatabase(), in_02232) select * from numbers(1e6)",
"thread_ids": [
    "21153",
    "39724",
    "39653"
],

"query": "INSERT INTO test_h88210.in_02232 (key) VALUES",
"thread_ids": [
    "3267",
    "3267"
],

Thread 326 (Thread 0x7f4dd2bf7700 (LWP 3267)):
5  0x0000000038d8c548 in Poco::Net::StreamSocketImpl::sendBytes (this=0x6040000d8350, buffer=0x7f4d263ec800, length=419, flags=0) at ../contrib/poco/Net/src/StreamSocketImpl.cpp:63
6  0x000000002b9449df in DB::WriteBufferFromPocoSocket::nextImpl (this=<optimized out>) at ../src/IO/WriteBufferFromPocoSocket.cpp:58
7  0x000000002e3a41b8 in DB::WriteBuffer::next (this=0x7f4dd2beb660) at ../src/IO/WriteBuffer.h:50
8  DB::TCPHandler::sendLogData (this=this@entry=0x61a0014d1880, block=...) at ../src/Server/TCPHandler.cpp:1688
9  0x000000002e38e6b0 in DB::TCPHandler::sendLogs (this=0x7f4dd2beb660) at ../src/Server/TCPHandler.cpp:1759
10 0x000000002e38f7f4 in DB::TCPHandler::readDataNext (this=<optimized out>) at ../src/Server/TCPHandler.cpp:583
11 0x000000002e38bdd2 in DB::TCPHandler::processInsertQuery()::$_0::operator()<DB::PushingPipelineExecutor>(DB::PushingPipelineExecutor&) const (executor=..., this=<optimized out>) at ../src/Server/TCPHandler.cpp:640

Thread 1187 (Thread 0x7f4b23a9e700 (LWP 39653)):
5  0x0000000038d8c548 in Poco::Net::StreamSocketImpl::sendBytes (this=0x6040015e2fd0, buffer=0x7f4d24ef7800, length=54, flags=0) at ../contrib/poco/Net/src/StreamSocketImpl.cpp:63
6  0x000000002b9449df in DB::WriteBufferFromPocoSocket::nextImpl (this=<optimized out>) at ../src/IO/WriteBufferFromPocoSocket.cpp:58
7  0x000000002e13a3c3 in DB::WriteBuffer::next (this=0x7f4b23a93380) at ../src/IO/WriteBuffer.h:50
8  DB::Connection::sendData (this=<optimized out>, block=..., name=..., scalar=<optimized out>) at ../src/Client/Connection.cpp:584
9  0x000000002d77c2da in DB::RemoteInserter::write (this=0x61500595c0f8, block=<error reading variable: Cannot access memory at address 0x1>) at ../src/QueryPipeline/RemoteInserter.cpp:110

alexey-milovidov · 2022-05-07T22:16:57Z

@azat let's continue.

azat · 2022-05-25T17:07:49Z

So, I've thought about this, and there is no way to consume Log packets properly in case of INSERT into Distributed, so I just disable send_logs_level for those queries. Also note, that this does not differs to the user, in what ClickHouse did before, since before it simply does not consume those packets, so the client does not saw those messages anyway.

azat · 2022-05-26T09:46:51Z

Stateless tests flaky check (address, actions) — Tests are not finished, fail: 0, passed: 106

Tests are not finished likely becase of timeout (30min):

2022-05-25 16:43:47 Using queries from '/usr/share/clickhouse-test/queries' directory
...
2022-05-25 17:09:37 02232_dist_insert_send_logs_level_hung:                                 [ OK ] 251.83 sec.

And another issue due to 10 seconds was not enough for the sever to stop:

Code: 76. DB::Exception: Cannot lock file /var/lib/clickhouse/status. Another server instance in same directory is already running. (CANNOT_OPEN_FILE)

Proper wait of the clickhouse-server in tests #37560

azat · 2022-06-08T04:11:17Z

Integration tests (asan, actions) [1/3] — fail: 3, passed: 676, flaky: 0

test_grpc_protocol/test.py::test_progress - Try fix test_grpc_protocol/test.py::test_progress #37908
test_keeper_zookeeper_converter/test.py::test_simple_crud_requests[False] - Fix keeper converter test #37890
test_keeper_zookeeper_converter/test.py::test_multi_and_failed_requests[False] - Fix keeper converter test #37890

Stateless tests (thread, actions) [1/3] — fail: 1, passed: 1390, skipped: 11

01183_custom_separated_format_http

Stateless tests (thread, actions) [2/3] — fail: 1, passed: 1382, skipped: 18

01746_long_zstd_http_compression_json_format - flaky

Stateless tests flaky check (address, actions) — Tests are not finished, fail: 0, passed: 105

Timeout - OK

azat · 2022-06-13T10:42:09Z

PullRequestCI / TestsBugfixCheck (pull_request) Failing after 55m — TestsBugfixCheck

2022-06-07 22:53:30 02232_dist_insert_send_logs_level_hung:                                 [ FAIL ] 620.02 sec. - having stderror: 
...
2022-06-07 22:54:30 Found hung queries in processlist:
2022-06-07 22:54:30         "elapsed": 678.058238597,
2022-06-07 22:54:30         "is_cancelled": 0,
2022-06-07 22:54:30         "is_all_data_sent": 0,
2022-06-07 22:54:30         "query": "insert into function remote('127.2', currentDatabase(), in_02232) select * from numbers(1e6)",
...

OK, but let's add a no-bc tag, to avoid hung queries in stress tests for backward compatibility check

… hung In case of INSERT into Distributed table with send_logs_level!=none it is possible to receive tons of Log packets, and w/o consuming it properly the socket buffer will be full, and eventually the query will hung. This happens because receiver will not read data until it will send logs packets, but sender does not reads those Log packets and so receiver hung, and hence the sender will hung too, because receiver do not consume Data packets anymore. In the initial version of this patch I tried to properly consume Log packets, but it is not possible to ensure that before writing Data blocks all Log packets had been consumed, that said that with current protocol implementation it is not possible to fix Log packets consuming properly, to avoid deadlock, so send_logs_level had been simply disabled. But note, that this does not differs to the user, in what ClickHouse did before, since before it simply does not consume those packets, so the client does not saw those messages anyway. <details> The receiver: Poco::Net::SocketImpl::poll(Poco::Timespan const&, int) Poco::Net::SocketImpl::sendBytes(void const*, int, int) Poco::Net::StreamSocketImpl::sendBytes(void const*, int, int) DB::WriteBufferFromPocoSocket::nextImpl() DB::TCPHandler::sendLogData(DB::Block const&) DB::TCPHandler::sendLogs() DB::TCPHandler::readDataNext() DB::TCPHandler::processInsertQuery() State Recv-Q Send-Q Local Address:Port Peer Address:Port Process ESTAB 4331792 211637 127.0.0.1:9000 127.0.0.1:24446 users:(("clickhouse-serv",pid=46874,fd=3850)) The sender: Poco::Net::SocketImpl::poll(Poco::Timespan const&, int) Poco::Net::SocketImpl::sendBytes(void const*, int, int) Poco::Net::StreamSocketImpl::sendBytes(void const*, int, int) DB::WriteBufferFromPocoSocket::nextImpl() DB::WriteBuffer::write(char const*, unsigned long) DB::CompressedWriteBuffer::nextImpl() DB::WriteBuffer::write(char const*, unsigned long) DB::SerializationString::serializeBinaryBulk(DB::IColumn const&, DB::WriteBuffer&, unsigned long, unsigned long) const DB::NativeWriter::write(DB::Block const&) DB::Connection::sendData(DB::Block const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, bool) DB::RemoteInserter::write(DB::Block) DB::RemoteSink::consume(DB::Chunk) DB::SinkToStorage::onConsume(DB::Chunk) State Recv-Q Send-Q Local Address:Port Peer Address:Port Process ESTAB 67883 3008240 127.0.0.1:24446 127.0.0.1:9000 users:(("clickhouse-serv",pid=41610,fd=25)) </details> v2: rebase to use clickhouse_client_timeout and add clickhouse_test_wait_queries v3: use KILL QUERY v4: adjust the test v5: disable send_logs_level for INSERT into Distributed v6: add no-backward-compatibility-check tag Signed-off-by: Azat Khuzhin <[email protected]>

azat · 2022-06-13T15:39:15Z

PullRequestCI / TestsBugfixCheck (pull_request) Failing after 57m — TestsBugfixCheck

2022-06-13T11:04:55.8467948Z 2022-06-13 08:04:55 02232_dist_insert_send_logs_level_hung:                                 [ FAIL ] 620.02 sec. - having stderror: 
...
_csv.Error: field larger than field limit (131072)
...
FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/actions-runner/_work/_temp/tests_bugfix_check/stateless/post_commit_status.tsv'

tests: avoid "_csv.Error: field larger than field limit (131072)" error #38030

Right now RemoteInserter does not read ProfileEvents for INSERT, it handles them only after sending the query or on finish. But ClickHouse#37391 sends them for each INSERT block, but sometimes they can be no ProfileEvents packet, since it sends only non-empty blocks. And this adds too much complexity, and anyway ProfileEvents are useless for the server, so let's send them only if the query is initial (i.e. send by user). Note, that it is okay to change the logic of sending ProfileEvents w/o changing DBMS_TCP_PROTOCOL_VERSION, because there were no public releases with the original patch included yet. Fixes: ClickHouse#37391 Refs: ClickHouse#35075 Signed-off-by: Azat Khuzhin <[email protected]>

robot-clickhouse added the pr-bugfix Pull request with bugfix, not backported by default label Mar 5, 2022

alexey-milovidov approved these changes Mar 20, 2022

View reviewed changes

alexey-milovidov self-assigned this Mar 20, 2022

azat force-pushed the dist-insert-hung-fix branch from f3e0550 to 779e7c8 Compare March 26, 2022 20:52

azat marked this pull request as draft April 3, 2022 17:18

azat force-pushed the dist-insert-hung-fix branch from 779e7c8 to 98e1e45 Compare April 3, 2022 20:08

azat marked this pull request as ready for review April 3, 2022 20:10

azat marked this pull request as draft April 4, 2022 19:58

azat force-pushed the dist-insert-hung-fix branch from 98e1e45 to 81c3be8 Compare April 25, 2022 13:36

azat marked this pull request as ready for review April 25, 2022 13:36

azat marked this pull request as draft April 25, 2022 15:49

azat changed the title ~~Fix possible Distributed INSERT hung (in case of send_logs_level != none)~~ Disable send_logs_level for INSERT into Distributed to avoid possible hung May 25, 2022

azat force-pushed the dist-insert-hung-fix branch 2 times, most recently from 5f7eb94 to 4a064e5 Compare May 25, 2022 17:06

azat marked this pull request as ready for review May 25, 2022 17:07

azat requested a review from alexey-milovidov May 31, 2022 12:09

azat force-pushed the dist-insert-hung-fix branch 2 times, most recently from b2751f9 to c00b66d Compare June 7, 2022 14:21

azat force-pushed the dist-insert-hung-fix branch from c00b66d to 100410c Compare June 7, 2022 18:33

azat force-pushed the dist-insert-hung-fix branch from 100410c to 7210be1 Compare June 13, 2022 10:45

alexey-milovidov merged commit 354a10d into ClickHouse:master Jun 19, 2022

azat deleted the dist-insert-hung-fix branch June 20, 2022 04:38

tavplubix mentioned this pull request Jun 20, 2022

Hung check failed on insertion into Distributed #38219

Closed

azat mentioned this pull request Jun 22, 2022

Fix INSERT into Distributed hung due to ProfileEvents #38307

Merged

Conversation

azat commented Mar 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexey-milovidov left a comment

Choose a reason for hiding this comment

Uh oh!

alexey-milovidov commented Mar 20, 2022

Uh oh!

mergify bot commented Mar 20, 2022

✅ Branch has been successfully updated

Uh oh!

alexey-milovidov commented Mar 25, 2022

Uh oh!

alexey-milovidov commented Apr 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexey-milovidov commented Apr 3, 2022

Uh oh!

azat commented Apr 3, 2022

Uh oh!

alexey-milovidov commented Apr 4, 2022

Uh oh!

azat commented Apr 25, 2022

Uh oh!

azat commented Apr 25, 2022

Uh oh!

alexey-milovidov commented May 7, 2022

Uh oh!

azat commented May 25, 2022

Uh oh!

azat commented May 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

azat commented Jun 8, 2022

Uh oh!

azat commented Jun 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

azat commented Jun 13, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

azat commented Mar 5, 2022 •

edited

Loading

alexey-milovidov commented Apr 1, 2022 •

edited

Loading

azat commented May 26, 2022 •

edited

Loading

azat commented Jun 13, 2022 •

edited

Loading