Skip to content

Retry NoMessageException in DataPartsExchange#4837

Closed
nvartolomei wants to merge 1 commit intoClickHouse:masterfrom
nvartolomei:nv/no-message-received
Closed

Retry NoMessageException in DataPartsExchange#4837
nvartolomei wants to merge 1 commit intoClickHouse:masterfrom
nvartolomei:nv/no-message-received

Conversation

@nvartolomei
Copy link
Copy Markdown
Contributor

@nvartolomei nvartolomei commented Mar 28, 2019

If a connection from connection pool is idle for too long the server
will try to close it, leaving the socket in CLOSE-WAIT state. As there
is no connection manager that will poll connections to check if they are
still alive there is a chance that we'll get a connection in this state
out of the pool. Writing to a socket in CLOSE-WAIT can succeed, and
Poco's test for keep alive timeout doesn't seem to work correctly.

Solution: retry this error until we succeed. Poco will create new
connections for us.
Alternative solution: Try to peek one byte from connection before
returning it from the pool, may look cleaner, but there is still a very
low chance of connection getting closed by the time we try to read from
it.

strace example:

218921 sendto(546, "POST /?endpoint=DataPartsExchang"..., 207, 0, NULL, 0 <unfinished ...>
218921 <... sendto resumed> )           = 207
218921 recvfrom(546,  <unfinished ...>
218921 <... recvfrom resumed> "", 4096, 0, NULL, NULL) = 0
218921 close(546 <unfinished ...>
218921 <... close resumed> )            = 0
218921 write(2, "2019.03.28 18:01:54.082307 [ 27 "..., 315 <unfinished ...>

and log message:

2019.03.28 18:01:54.082307 [ 27 ] {} <Error> table_name (StorageReplicatedMergeTree): DB::StorageReplicatedMergeTree::queueTask()::<lambda(DB::StorageReplicatedMergeTree::LogEntryPtr&)>: Poco::Exception. Code: 1000, e.code() = 0, e.displayText() = No message received, e.what() = No message received

Closes #4047.

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

@nvartolomei
Copy link
Copy Markdown
Contributor Author

Didn't have time to test, feel free to update PR and merge it as it fits.

@nvartolomei
Copy link
Copy Markdown
Contributor Author

Another alternative solution would be to make underlying class smarter and retry this error for idempotent requests as most http clients do and use GET for data exchange.

If a connection from connection pool is idle for too long the server
will try to close it, leaving the socket in CLOSE-WAIT state. As there
is no connection manager that will poll connections to check if they are
still alive there is a chance that we'll get a connection in this state
out of the pool. Writing to a socket in CLOSE-WAIT can succeed, and
Poco's test for keep alive timeout doesn't seem to work correctly.

Solution: retry this error until we succeed. Poco will create new
connections for us.

Alternative solution 1: Try to peek one byte from connection before
returning it from the pool, may look cleaner, but there is still a very
low chance of connection getting closed by the time we try to read from
it.

Alternative solution 2: would be to make underlying class smarter
and retry this error for idempotent requests as most http clients do and
use GET for data exchange.

strace example:

```
218921 sendto(546, "POST /?endpoint=DataPartsExchang"..., 207, 0, NULL, 0 <unfinished ...>
218921 <... sendto resumed> )           = 207
218921 recvfrom(546,  <unfinished ...>
218921 <... recvfrom resumed> "", 4096, 0, NULL, NULL) = 0
218921 close(546 <unfinished ...>
218921 <... close resumed> )            = 0
218921 write(2, "2019.03.28 18:01:54.082307 [ 27 "..., 315 <unfinished ...>
```

and log message:

```
2019.03.28 18:01:54.082307 [ 27 ] {} <Error> table_name (StorageReplicatedMergeTree): DB::StorageReplicatedMergeTree::queueTask()::<lambda(DB::StorageReplicatedMergeTree::LogEntryPtr&)>: Poco::Exception. Code: 1000, e.code() = 0, e.displayText() = No message received, e.what() = No message received
```

Closes #4047.
@alesapin
Copy link
Copy Markdown
Member

@nvartolomei Do you have any ideas how to reproduce this problem in test?

@alesapin
Copy link
Copy Markdown
Member

alesapin commented Mar 29, 2019

Also I don't understand why this (implemented) solution will help:

retry this error until we succeed. Poco will create new
connections for us.

We will not create new Poco sessions if session for this schema://URI:port is already in pool https://github.com/yandex/ClickHouse/blob/master/dbms/src/IO/HTTPCommon.cpp#L149-L153.
Maybe you've observed a case when pool was not completely filled?

If a connection from connection pool is idle for too long the server
will try to close it, leaving the socket in CLOSE-WAIT state.

Isn't it the main function of HTTP keep-alive to avoid such situation? Is Poco's implementation completely wrong?

@nvartolomei
Copy link
Copy Markdown
Contributor Author

We will not create new Poco sessions if session for this schema://URI:port is already in pool

We don't have to. receiveResponse closes the connection on exception.

If we get the same http session from the pool. sendRequest creates a new connection if underlying socket is closed or keep alive is timeout is expired.

Maybe you've observed a case when pool was not completely filled?

The pool is filled.

Isn't it the main function of HTTP keep-alive to avoid such situation? Is Poco's implementation completely wrong?

From my understanding keep-alive on server side will keep the connection open for keep-alive timeout, but close it after that. Poco implementation seems fine.

Do you have any ideas how to reproduce this problem in test?

Not really.

@nvartolomei
Copy link
Copy Markdown
Contributor Author

One thing that is missing: not respecting the timeouts in this for (;;) loop. If for some reason the server goes rogue we won't have a clear error in logs and replication_queue table.

@alesapin
Copy link
Copy Markdown
Member

alesapin commented Mar 29, 2019

From my understanding keep-alive on server side will keep the connection open for keep-alive timeout, but close it after that. Poco implementation seems fine.

Our default keep_alive_timeout is 3 seconds https://github.com/yandex/ClickHouse/blob/master/dbms/programs/server/config.xml#L88 for Interserver handler https://github.com/yandex/ClickHouse/blob/master/dbms/programs/server/InterserverIOHTTPHandler.cpp#L65. If we will reuse session from pool which is idle for more than 3 seconds we should receive No message received error.

I performed experiment:

  1. Added debug prints to Pool, just to be sure that we reuse session:
--- a/dbms/src/Common/PoolBase.h
+++ b/dbms/src/Common/PoolBase.h
@@ -113,11 +113,15 @@ public:
         {
             for (auto & item : items)
                 if (!item->in_use)
+                {
+                    LOG_INFO(&Logger::get("POOL"), "reusing old session");
                     return Entry(*item);
+                }
 
             if (items.size() < max_items)
             {
                 ObjectPtr object = allocObject();
+                LOG_INFO(&Logger::get("POOL"), "allocating new session");
                 items.emplace_back(std::make_shared<PooledObject>(object, *this));
                 return Entry(*items.back());
             }
  1. Setup replicated table on two clickhouse-servers.
  2. Filled pool with inserts.
  3. Sleep for 5 minutes.
  4. Perform single insert on one of nodes, log:
2019.03.29 12:45:58.051495 [ 18 ] {} <Trace> default.test_table (Data): Renaming temporary part tmp_merge_20170616_0_99_16 to 20170616_0_99_16.
2019.03.29 12:45:58.051597 [ 18 ] {} <Trace> default.test_table (MergerMutator): Merged 5 parts: from 20170616_0_95_15 to 20170616_99_99_0
2019.03.29 12:50:43.702870 [ 38 ] {} <Debug> default.test_table (ReplicatedMergeTreeQueue): Pulling 1 entries to queue: log-0000000117 - log-0000000117
2019.03.29 12:50:43.704649 [ 38 ] {} <Debug> default.test_table (ReplicatedMergeTreeQueue): Pulled 1 entries to queue.
2019.03.29 12:50:43.706060 [ 10 ] {} <Debug> default.test_table (StorageReplicatedMergeTree): Fetching part 20170617_0_0_0 from /clickhouse/tables/test2/replicated/replicas/node3
2019.03.29 12:50:43.706506 [ 10 ] {} <Information> POOL: reusing old session
2019.03.29 12:50:43.706556 [ 10 ] {} <Trace> ReadWriteBufferFromHTTP: Sending request to http://node3:9009/?endpoint=DataPartsExchange%3A%2Fclickhouse%2Ftables%2Ftest2%2Freplicated%2Freplicas%2Fnode3&part=20170617_0_0_0&compress=false
2019.03.29 12:50:43.721914 [ 10 ] {} <Trace> default.test_table (Data): Renaming temporary part tmp_fetch_20170617_0_0_0 to 20170617_0_0_0.
2019.03.29 12:50:43.724646 [ 10 ] {} <Debug> default.test_table (StorageReplicatedMergeTree): Fetched part 20170617_0_0_0 from /clickhouse/tables/test2/replicated/replicas/node3
2019.03.29 12:50:53.804169 [ 46 ] {} <Trace> TCPHandlerFactory: TCP Request. Address: 172.24.0.1:36346
2019.03.29 12:50:53.804698 [ 46 ] {} <Debug> TCPHandler: Connected ClickHouse client version 19.5.0, revision: 54417, user: default.

@alesapin
Copy link
Copy Markdown
Member

alesapin commented Mar 29, 2019

Don't merge please, I'm investigating the issue.

@alesapin
Copy link
Copy Markdown
Member

Better fix #4856.

@nvartolomei
Copy link
Copy Markdown
Contributor Author

@alesapin I'd argue that this is still needed (in one form or another) and #4856 is just a hack to make it work for that specific use case. This exception would still be generated if one ClickHouse server for one reason or another has different keep alive settings, or as I mentioned in #4856 because of clock drifts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Poco::Exception. Code: 1000, e.code() = 0

3 participants