Skip to content

Conversation

@jacobtomlinson
Copy link
Member

Related to #3488.

I don't think this fixes the problem, but I think it should address the CommClosedError exceptions which are seen in the log.

When a worker registers with the scheduler it starts sending keepalive messages via a periodic callback. However that never seems to stop if the connection is broken. If the connection hangs for a long period of time the worker still attempts to send the keepalive messages.

This PR moves the definitions of the BatchedSend and keep-alive callback to the __init__ and stops the callback during the reconnect. This is already being done for the heartbeat.

@jacobtomlinson
Copy link
Member Author

I'm not entirely sure what I've broken here. Assistance would be appreciated.

@mrocklin
Copy link
Member

I'll try to take a look later today

@mrocklin
Copy link
Member

If you run tests with -s you'll see errors in the logs that look like the following

>       self.batched_stream = BatchedSend(interval="2ms", loop=self.loop)
E       AttributeError: 'Worker' object has no attribute 'loop'

I've pushed a patch moving the construction a bit lower, but on my machine there are still some failures. It might make sense to put this construction back in `start`` if there isn't a big reason to move it out.

@jacobtomlinson
Copy link
Member Author

Thanks for pushing to this @mrocklin. The CI seems happy here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants