Fix distributed requests cancellation with async_socket_for_remote=1#21643
Merged
KochetovNicolai merged 1 commit intoClickHouse:masterfrom Mar 16, 2021
Merged
Conversation
Before this patch for distributed queries, that requires cancellation
(simple select from multiple shards with limit, i.e. `select * from
remote('127.{2,3}', system.numbers) limit 100`) it is very easy to
trigger the situation when remote shard is in the middle of sending Data
block while the initiator already send Cancel and expecting some new
packet, but it will receive not new packet, but part of the Data block
that was in the middle of sending before cancellation, and this will
lead to some various errors, like:
- Unknown packet X from server Y
- Unexpected packet from server Y
- and a lot more...
Fix this, by correctly waiting for the pending packet before
cancellation.
It is not very easy to write a test, since localhost is too fast.
Also note, that it is not possible to get these errors with hedged
requests (use_hedged_requests=1) since handle fibers correctly.
But it had been disabled by default for 21.3 in ClickHouse#21534, while
async_socket_for_remote is enabled by default.
5885309 to
65f90f2
Compare
Member
Author
|
Member
Author
@vzakaznikov can you take a look please? |
Contributor
Looking into it. |
Contributor
|
Try to run CI/CD again. The issue with TestFlows check will be fixed with #21673. |
Member
Author
I guess there is no need in this, CI will catch it later (if any).
Great, thanks! |
Member
Author
|
KochetovNicolai
approved these changes
Mar 16, 2021
This was referenced Mar 16, 2021
robot-clickhouse
pushed a commit
that referenced
this pull request
Mar 16, 2021
…sync_socket_for_remote=1
robot-clickhouse
pushed a commit
that referenced
this pull request
Mar 16, 2021
…sync_socket_for_remote=1
robot-clickhouse
pushed a commit
that referenced
this pull request
Mar 16, 2021
…sync_socket_for_remote=1
KochetovNicolai
added a commit
that referenced
this pull request
Mar 18, 2021
Backport #21643 to 21.1: Fix distributed requests cancellation with async_socket_for_remote=1
KochetovNicolai
added a commit
that referenced
this pull request
Mar 18, 2021
Backport #21643 to 21.2: Fix distributed requests cancellation with async_socket_for_remote=1
KochetovNicolai
added a commit
that referenced
this pull request
Mar 18, 2021
Backport #21643 to 21.3: Fix distributed requests cancellation with async_socket_for_remote=1
azat
added a commit
to azat/ClickHouse
that referenced
this pull request
Mar 26, 2021
…r_remote=1 In ClickHouse#21643 async_socket_for_remote=1 was fixed to avoid leaving the connection in the unsynchronised state. But one should not try to wait for the current packet in case of timeout because this will exceed the timeout. Anyway if the timeout is exceeded, then the connection will be shutdown (disconnected), so it will not left in an unsynchronised state.
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Fix distributed requests cancellation (for example simple select from multiple shards with limit, i.e.
select * from remote('127.{2,3}', system.numbers) limit 100) withasync_socket_for_remote=1Detailed description / Documentation draft:
Before this patch for distributed queries, that requires cancellation
(simple select from multiple shards with limit, i.e.
select * from remote('127.{2,3}', system.numbers) limit 100) it is very easy totrigger the situation when remote shard is in the middle of sending Data
block while the initiator already send Cancel and expecting some new
packet, but it will receive not new packet, but part of the Data block
that was in the middle of sending before cancellation, and this will
lead to some various errors, like:
Fix this, by correctly waiting for the pending packet before
cancellation.
It is not very easy to write a test, since localhost is too fast.
Also note, that it is not possible to get these errors with hedged
requests (use_hedged_requests=1) since handle fibers correctly.
But it had been disabled by default for 21.3 in #21534, while
async_socket_for_remote is enabled by default.
Cc: @KochetovNicolai (this is a just bug fix patch for backporting, I will take a look at long-term fix after, that may require some interface changes)
Fixes: #21588