-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Weirdness with timeouts in StorageDistributed #36541
Copy link
Copy link
Closed
Labels
testingSpecial issue with list of bugs found by CISpecial issue with list of bugs found by CI
Description
The test simple breaks connectivity between two nodes and checks, that connection timeout in StorageDistributed works:
ClickHouse/tests/integration/test_distributed_respect_user_timeouts/test.py
Lines 174 to 187 in 3246261
| with PartitionManager() as pm: | |
| # Break the connection. | |
| pm.partition_instances(*list(NODES.values())) | |
| # Now it shouldn't: | |
| _check_timeout_and_exception(node, first_user, query_base, query) | |
| # Other user should have different timeout and exception | |
| _check_timeout_and_exception( | |
| node, | |
| "default" if first_user != "default" else "ready_to_wait", | |
| query_base, | |
| query, | |
| ) |
But seem slike something went wrong and PartitionManager did not break connectivity completely, so node2 has successfully connected to node1 and sent some data (the following log is from node2):
2022.04.20 18:56:39.566138 [ 224 ] {224b5ee7-ad99-4575-98bb-f7dab9bbfb87} <Trace> Connection (node1:9000): Connecting. Database: (not specified). User: default
2022.04.20 18:56:42.569127 [ 224 ] {224b5ee7-ad99-4575-98bb-f7dab9bbfb87} <Warning> HedgedConnectionsFactory: Connection failed at try №1, reason: Code: 209. DB::NetException: Timeout: connect timed out: 172.16.5.3:9000 (node1:9000, receive timeout 0 ms, send timeout 0 ms). (SOCKET_TIMEOUT) (version 22.4.1.2246)
2022.04.20 18:56:42.569275 [ 224 ] {224b5ee7-ad99-4575-98bb-f7dab9bbfb87} <Trace> Connection (node1:9000): Connecting. Database: (not specified). User: default
2022.04.20 18:56:42.571173 [ 224 ] {224b5ee7-ad99-4575-98bb-f7dab9bbfb87} <Trace> Connection (node1:9000): Connected to ClickHouse server version 22.4.1.
2022.04.20 18:56:42.572354 [ 224 ] {224b5ee7-ad99-4575-98bb-f7dab9bbfb87} <Debug> Connection (node1:9000): Sent data for 2 scalars, total 2 rows in 0.000386693 sec., 5157 rows/sec., 68.00 B (171.22 KiB/sec.), compressed 0.4594594594594595 times to 148.00 B (372.59 KiB/sec.)
Then it hung for ~17 minutes (with no log messages) and finally failed:
2022.04.20 19:13:09.901008 [ 225 ] {224b5ee7-ad99-4575-98bb-f7dab9bbfb87} <Trace> StorageDistributed (distributed_table): () Cancelling query
2022.04.20 19:13:09.919001 [ 10 ] {224b5ee7-ad99-4575-98bb-f7dab9bbfb87} <Error> executeQuery: Code: 209. DB::NetException: Timeout exceeded while reading from socket (172.16.5.3:9000, 300000 ms): while receiving packet from node1:9000: While executing Remote. (SOCKET_TIMEOUT) (version 22.4.1.2246) (from 172.16.5.1:35710) (in query: SELECT node FROM distributed_table ORDER BY node), Stack trace (when copying this message, always include the lines below):
0. ./build_docker/../contrib/libcxx/include/exception:133: Poco::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) @ 0x391486e9 in /usr/bin/clickhouse
1. ./build_docker/../src/Common/Exception.cpp:58: DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool) @ 0xd5890e8 in /usr/bin/clickhouse
2. ./build_docker/../src/Common/NetException.h:12: DB::ReadBufferFromPocoSocket::nextImpl() @ 0x2b925dc9 in /usr/bin/clickhouse
3. ./build_docker/../src/IO/ReadBuffer.h:86: void DB::readVarUIntImpl<false>(unsigned long&, DB::ReadBuffer&) @ 0xd65a87e in /usr/bin/clickhouse
4. ./build_docker/../src/IO/VarInt.h:0: DB::Connection::receivePacket() @ 0x2e10b03b in /usr/bin/clickhouse
5. ./build_docker/../src/Client/PacketReceiver.h:0: DB::PacketReceiver::Routine::operator()(boost::context::fiber&&) @ 0x2e157165 in /usr/bin/clickhouse
6. ./build_docker/../contrib/libcxx/include/__utility/swap.h:36: boost::context::detail::fiber_capture_record<boost::context::fiber, FiberStack&, DB::PacketReceiver::Routine>::run() @ 0x2e156321 in /usr/bin/clickhouse
7. ./build_docker/../contrib/boost/boost/context/fiber_ucontext.hpp:74: void boost::context::detail::fiber_entry_func<boost::context::detail::fiber_capture_record<boost::context::fiber, FiberStack&, DB::PacketReceiver::Routine> >(void*) @ 0x2e150eab in /usr/bin/clickhouse
The weird things are:
- How did
node2connected tonode1whenPartitionManagerwas active? Ok, it adds only one rule that drops only packets fromnode1tonode2, but I thought TCP requires some acknowledgment to establish connection. Maybe something wrong withPartitionManager/iptables rules in our integration tests environment. - Why did it take ~17 minutes to get
Timeout exceeded while reading from socketwhen timeout is 300000ms (5 minutes)? Also 17 is not divisible by 5. - Why did
IConnections::dumpAddressesreturn empty string? (StorageDistributed (distributed_table): () Cancelling query)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
testingSpecial issue with list of bugs found by CISpecial issue with list of bugs found by CI