Fix connection timed out... #4130
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
#4080
did more digging with the reproducer and identified the failure mode, which is causing connection timeouts and failed tasks on our 50 worker cluster.
100_000who_hasrequests in a quite short period in time.ConnectionPoolofWorker.scheduler, which has the default connection limit of 512. This will result in ~25,000 concurrent connection attempts and saturate the scheduler listener socket.Logs would typically output:
Tornado connection attempt - message:
Connect start: <SCHEDULER>170,000 events - Worker sideTornado connection success - message:
Connect done16,000 events - Worker sideTornado connection accept - message:
On connection13,000 events - Scheduler sideDask handshake success - message:
handshake:6,000 events - Worker and Scheduler sideWorkers would typically fail tasks or crash due to connection timeouts.
I doubt that this is a good solution and do not intend to do further work on this PR. Nevertheless, I believe that this 'fix' helps to shed light on the root-cause of the issue.