Stop keep alives when worker reconnecting to the scheduler #3493

jacobtomlinson · 2020-02-18T14:01:21Z

Related to #3488.

I don't think this fixes the problem, but I think it should address the CommClosedError exceptions which are seen in the log.

When a worker registers with the scheduler it starts sending keepalive messages via a periodic callback. However that never seems to stop if the connection is broken. If the connection hangs for a long period of time the worker still attempts to send the keepalive messages.

This PR moves the definitions of the BatchedSend and keep-alive callback to the __init__ and stops the callback during the reconnect. This is already being done for the heartbeat.

jacobtomlinson · 2020-02-18T15:35:41Z

I'm not entirely sure what I've broken here. Assistance would be appreciated.

mrocklin · 2020-02-18T16:28:44Z

I'll try to take a look later today

mrocklin · 2020-02-19T00:15:07Z

If you run tests with -s you'll see errors in the logs that look like the following

>       self.batched_stream = BatchedSend(interval="2ms", loop=self.loop)
E       AttributeError: 'Worker' object has no attribute 'loop'

I've pushed a patch moving the construction a bit lower, but on my machine there are still some failures. It might make sense to put this construction back in `start`` if there isn't a big reason to move it out.

jacobtomlinson · 2020-02-19T10:54:21Z

Thanks for pushing to this @mrocklin. The CI seems happy here.

jacobtomlinson added 2 commits February 18, 2020 13:53

Stop keep alives when worker reconnecting to the scheduler

00c9ab8

Move definitions to init

465dadd

jacobtomlinson mentioned this pull request Feb 18, 2020

distributed.comm.core.CommClosedError loop failing to Terminate worker pods #3488

Open

Move BatchedSend construction to after loop is defined

65c64f4

mrocklin merged commit 83f8feb into dask:master Feb 19, 2020

jacobtomlinson deleted the stop-kee-alives branch February 20, 2020 09:12

gjoseph92 mentioned this pull request May 5, 2022

Properly support restarting BatchedSend #5481

Closed

gjoseph92 mentioned this pull request May 20, 2022

Add validation to BatchedSend and convert to asyncio #6389

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Stop keep alives when worker reconnecting to the scheduler #3493

Stop keep alives when worker reconnecting to the scheduler #3493

Uh oh!

jacobtomlinson commented Feb 18, 2020

Uh oh!

jacobtomlinson commented Feb 18, 2020

Uh oh!

mrocklin commented Feb 18, 2020

Uh oh!

mrocklin commented Feb 19, 2020

Uh oh!

jacobtomlinson commented Feb 19, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Stop keep alives when worker reconnecting to the scheduler #3493

Stop keep alives when worker reconnecting to the scheduler #3493

Uh oh!

Conversation

jacobtomlinson commented Feb 18, 2020

Uh oh!

jacobtomlinson commented Feb 18, 2020

Uh oh!

mrocklin commented Feb 18, 2020

Uh oh!

mrocklin commented Feb 19, 2020

Uh oh!

jacobtomlinson commented Feb 19, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants