Skip to content

Fix distributed requests cancellation with async_socket_for_remote=1#21643

Merged
KochetovNicolai merged 1 commit intoClickHouse:masterfrom
azat:async_socket_for_remote-cancel-fix
Mar 16, 2021
Merged

Fix distributed requests cancellation with async_socket_for_remote=1#21643
KochetovNicolai merged 1 commit intoClickHouse:masterfrom
azat:async_socket_for_remote-cancel-fix

Conversation

@azat
Copy link
Copy Markdown
Member

@azat azat commented Mar 11, 2021

Changelog category (leave one):

  • Bug Fix

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Fix distributed requests cancellation (for example simple select from multiple shards with limit, i.e. select * from remote('127.{2,3}', system.numbers) limit 100) with async_socket_for_remote=1

Detailed description / Documentation draft:
Before this patch for distributed queries, that requires cancellation
(simple select from multiple shards with limit, i.e. select * from remote('127.{2,3}', system.numbers) limit 100) it is very easy to
trigger the situation when remote shard is in the middle of sending Data
block while the initiator already send Cancel and expecting some new
packet, but it will receive not new packet, but part of the Data block
that was in the middle of sending before cancellation, and this will
lead to some various errors, like:

  • Unknown packet X from server Y
  • Unexpected packet from server Y
  • and a lot more...

Fix this, by correctly waiting for the pending packet before
cancellation.

It is not very easy to write a test, since localhost is too fast.

Also note, that it is not possible to get these errors with hedged
requests (use_hedged_requests=1) since handle fibers correctly.

But it had been disabled by default for 21.3 in #21534, while
async_socket_for_remote is enabled by default.

Cc: @KochetovNicolai (this is a just bug fix patch for backporting, I will take a look at long-term fix after, that may require some interface changes)

Fixes: #21588

@robot-clickhouse robot-clickhouse added the pr-bugfix Pull request with bugfix, not backported by default label Mar 11, 2021
Before this patch for distributed queries, that requires cancellation
(simple select from multiple shards with limit, i.e. `select * from
remote('127.{2,3}', system.numbers) limit 100`) it is very easy to
trigger the situation when remote shard is in the middle of sending Data
block while the initiator already send Cancel and expecting some new
packet, but it will receive not new packet, but part of the Data block
that was in the middle of sending before cancellation, and this will
lead to some various errors, like:
- Unknown packet X from server Y
- Unexpected packet from server Y
- and a lot more...

Fix this, by correctly waiting for the pending packet before
cancellation.

It is not very easy to write a test, since localhost is too fast.

Also note, that it is not possible to get these errors with hedged
requests (use_hedged_requests=1) since handle fibers correctly.

But it had been disabled by default for 21.3 in ClickHouse#21534, while
async_socket_for_remote is enabled by default.
@azat azat force-pushed the async_socket_for_remote-cancel-fix branch from 5885309 to 65f90f2 Compare March 11, 2021 18:55
@azat
Copy link
Copy Markdown
Member Author

azat commented Mar 11, 2021

AST fuzzer (TSan) — Received signal 11

#21646

@azat
Copy link
Copy Markdown
Member Author

azat commented Mar 12, 2021

Testflows check — Failed to process results

@vzakaznikov can you take a look please?

1 module (1 ok)
1340 suites (1338 ok, 2 xfail)
4914 scenarios (4346 ok, 568 xfail)
6981 examples (6835 ok, 80 xfail, 66 xerror)
295857 steps (294273 ok, 1443 failed, 66 errored, 75 xfail)

Total time 1h 14m

Executed on Mar 11,2021 23:17
TestFlows.com Open-Source Software Testing Framework v1.6.201216.1172002
error: 'test_type'

@vzakaznikov
Copy link
Copy Markdown
Contributor

Testflows check — Failed to process results

@vzakaznikov can you take a look please?

1 module (1 ok)
1340 suites (1338 ok, 2 xfail)
4914 scenarios (4346 ok, 568 xfail)
6981 examples (6835 ok, 80 xfail, 66 xerror)
295857 steps (294273 ok, 1443 failed, 66 errored, 75 xfail)

Total time 1h 14m

Executed on Mar 11,2021 23:17
TestFlows.com Open-Source Software Testing Framework v1.6.201216.1172002
error: 'test_type'

Looking into it.

@vzakaznikov
Copy link
Copy Markdown
Contributor

Try to run CI/CD again. The issue with TestFlows check will be fixed with #21673.

@azat
Copy link
Copy Markdown
Member Author

azat commented Mar 12, 2021

Try to run CI/CD again.

I guess there is no need in this, CI will catch it later (if any).

The issue with TestFlows check will be fixed with #21673.

Great, thanks!

@azat
Copy link
Copy Markdown
Member Author

azat commented Mar 12, 2021

Integration tests (thread) — fail: 1, passed: 1161, error: 0

#21676

@KochetovNicolai KochetovNicolai merged commit 0ffea30 into ClickHouse:master Mar 16, 2021
robot-clickhouse pushed a commit that referenced this pull request Mar 16, 2021
robot-clickhouse pushed a commit that referenced this pull request Mar 16, 2021
robot-clickhouse pushed a commit that referenced this pull request Mar 16, 2021
@azat azat deleted the async_socket_for_remote-cancel-fix branch March 16, 2021 18:26
KochetovNicolai added a commit that referenced this pull request Mar 18, 2021
Backport #21643 to 21.1: Fix distributed requests cancellation with async_socket_for_remote=1
KochetovNicolai added a commit that referenced this pull request Mar 18, 2021
Backport #21643 to 21.2: Fix distributed requests cancellation with async_socket_for_remote=1
KochetovNicolai added a commit that referenced this pull request Mar 18, 2021
Backport #21643 to 21.3: Fix distributed requests cancellation with async_socket_for_remote=1
azat added a commit to azat/ClickHouse that referenced this pull request Mar 26, 2021
…r_remote=1

In ClickHouse#21643 async_socket_for_remote=1 was fixed to avoid leaving the
connection in the unsynchronised state.

But one should not try to wait for the current packet in case of timeout
because this will exceed the timeout.

Anyway if the timeout is exceeded, then the connection will be shutdown
(disconnected), so it will not left in an unsynchronised state.
@azat azat mentioned this pull request Apr 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-bugfix Pull request with bugfix, not backported by default

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unknown packet XX from server

4 participants