Retry NoMessageException in DataPartsExchange#4837
Retry NoMessageException in DataPartsExchange#4837nvartolomei wants to merge 1 commit intoClickHouse:masterfrom nvartolomei:nv/no-message-received
Conversation
|
Didn't have time to test, feel free to update PR and merge it as it fits. |
|
Another alternative solution would be to make underlying class smarter and retry this error for idempotent requests as most http clients do and use GET for data exchange. |
If a connection from connection pool is idle for too long the server
will try to close it, leaving the socket in CLOSE-WAIT state. As there
is no connection manager that will poll connections to check if they are
still alive there is a chance that we'll get a connection in this state
out of the pool. Writing to a socket in CLOSE-WAIT can succeed, and
Poco's test for keep alive timeout doesn't seem to work correctly.
Solution: retry this error until we succeed. Poco will create new
connections for us.
Alternative solution 1: Try to peek one byte from connection before
returning it from the pool, may look cleaner, but there is still a very
low chance of connection getting closed by the time we try to read from
it.
Alternative solution 2: would be to make underlying class smarter
and retry this error for idempotent requests as most http clients do and
use GET for data exchange.
strace example:
```
218921 sendto(546, "POST /?endpoint=DataPartsExchang"..., 207, 0, NULL, 0 <unfinished ...>
218921 <... sendto resumed> ) = 207
218921 recvfrom(546, <unfinished ...>
218921 <... recvfrom resumed> "", 4096, 0, NULL, NULL) = 0
218921 close(546 <unfinished ...>
218921 <... close resumed> ) = 0
218921 write(2, "2019.03.28 18:01:54.082307 [ 27 "..., 315 <unfinished ...>
```
and log message:
```
2019.03.28 18:01:54.082307 [ 27 ] {} <Error> table_name (StorageReplicatedMergeTree): DB::StorageReplicatedMergeTree::queueTask()::<lambda(DB::StorageReplicatedMergeTree::LogEntryPtr&)>: Poco::Exception. Code: 1000, e.code() = 0, e.displayText() = No message received, e.what() = No message received
```
Closes #4047.
|
@nvartolomei Do you have any ideas how to reproduce this problem in test? |
|
Also I don't understand why this (implemented) solution will help: We will not create new Poco sessions if session for this Isn't it the main function of |
We don't have to. receiveResponse closes the connection on exception. If we get the same http session from the pool. sendRequest creates a new connection if underlying socket is closed or keep alive is timeout is expired.
The pool is filled.
From my understanding keep-alive on server side will keep the connection open for keep-alive timeout, but close it after that. Poco implementation seems fine.
Not really. |
|
One thing that is missing: not respecting the timeouts in this |
|
Our default I performed experiment:
|
|
Don't merge please, I'm investigating the issue. |
|
Better fix #4856. |
|
@alesapin I'd argue that this is still needed (in one form or another) and #4856 is just a hack to make it work for that specific use case. This exception would still be generated if one ClickHouse server for one reason or another has different keep alive settings, or as I mentioned in #4856 because of clock drifts. |
If a connection from connection pool is idle for too long the server
will try to close it, leaving the socket in CLOSE-WAIT state. As there
is no connection manager that will poll connections to check if they are
still alive there is a chance that we'll get a connection in this state
out of the pool. Writing to a socket in CLOSE-WAIT can succeed, and
Poco's test for keep alive timeout doesn't seem to work correctly.
Solution: retry this error until we succeed. Poco will create new
connections for us.
Alternative solution: Try to peek one byte from connection before
returning it from the pool, may look cleaner, but there is still a very
low chance of connection getting closed by the time we try to read from
it.
strace example:
and log message:
Closes #4047.
I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en