Remove worker reconnect #6361

gjoseph92 · 2022-05-18T01:08:10Z

When a worker disconnects from the scheduler, close it immediately instead of trying to reconnect.

Also prohibit workers from joining if they have data in memory, as an alternative to #6341.

Just seeing what CI does for now, to figure out which tests to change/remove.

Tests added / passed
Passes pre-commit run --all-files

github-actions · 2022-05-18T02:09:07Z

Unit Test Results

      15 files ±  0       15 suites ±0 7h 7m 11s ⏱️ - 5m 36s
  2 795 tests -   6   2 714 ✔️ -   7   79 💤 ±0 2 ❌ +1
20 725 runs - 45 19 803 ✔️ - 45 920 💤 - 1 2 ❌ +1

For more details on these failures, see this check.

Results for commit 305a23f. ± Comparison against base commit ff94776.

♻️ This comment has been updated with latest results.

The Nanny can still restart the worker process.

…processing_data

…tions`

…mid_compute_multiple_states_on_scheduler

…sk_memory_with_resources`

gjoseph92 · 2022-05-18T21:17:32Z

Everything is green except one flaky test_stress_scatter_death #6305, and a single distributed/tests/test_client.py::test_restart_timeout_is_logged on Windows is leaking a thread:

E               AssertionError: (<Thread(AsyncProcess Dask Worker process (from Nanny) watch process join, started daemon 5744)>, ['  File "C:\\Minico...rocessing\\popen_spawn_win32.py", line 108, in wait
E                 \tres = _winapi.WaitForSingleObject(int(self._handle), msecs)
E                 '])
E               assert False

That feels flaky to me, but I guess it's possible this is related?

distributed/cli/dask_worker.py

gjoseph92 · 2022-05-18T21:48:08Z

distributed/scheduler.py

+                # Remove suspicious workers from the scheduler and shut them down.
                await asyncio.gather(
                    *(
                        self.remove_worker(


This is maybe heavy-handed. retire_workers would probably be less disruptive, but slower and a little more complex. I don't love losing all the other keys on a worker just because it's missing one.

I don't mind be heavy handed, especially if things are in a wonky state.

distributed/tests/test_client.py

mrocklin

Incomplete review, but I'm stuck in meetings for the next bit

distributed/scheduler.py

mrocklin · 2022-05-19T15:09:03Z

distributed/scheduler.py

+                # Remove suspicious workers from the scheduler and shut them down.
                await asyncio.gather(
                    *(
                        self.remove_worker(


I don't mind be heavy handed, especially if things are in a wonky state.

distributed/tests/test_client.py

distributed/tests/test_scheduler.py

mrocklin · 2022-05-19T17:56:51Z

In general I'm ok with what's here

gjoseph92

One thing we discussed offline was whether the Nanny should restart the worker in cases when the connection to the scheduler is broken unexpectedly. The answer is probably yes, but as this PR stands, Worker.close will always tell the Nanny it's closing gracefully, preventing restart:

distributed/distributed/worker.py

Lines 1475 to 1477 in ff94776

    
           if nanny and self.nanny: 
        
               with self.rpc(self.nanny) as r: 
        
                   await r.close_gracefully()

As a follow-up PR, we should add an option to Worker.close to control this behavior.

distributed/cli/dask_worker.py

distributed/scheduler.py

distributed/tests/test_client.py

distributed/tests/test_scheduler.py

mrocklin · 2022-05-19T19:16:29Z

One thing we discussed offline was whether the Nanny should restart the worker in cases when the connection to the scheduler is broken unexpectedly. The answer is probably yes, but as this PR stands, Worker.close will always tell the Nanny it's closing gracefully, preventing restart:

As a follow-up PR, we should add an option to Worker.close to control this behavior

I may not be understanding the statement above, but we already have the nanny= option to Worker.close, right? That's the code snippet that you've highlighted. I think the question is if there is a specific call of Worker.close when we want to pass nanny=False. Agreed?

Also make worker log whatever error the scheduler gives it upon connection

gjoseph92 · 2022-05-19T19:36:45Z

we already have the nanny= option to Worker.close

Great point, I just missed that. Yes, we can set nanny=False in these cases here. I'll do that.

gjoseph92 · 2022-05-19T21:16:03Z

@mrocklin actually, I think maybe we should do this in a follow-up PR. It's pretty straightforward, but involves a little bit of thinking and a few more tests (the nanny is going to try to report the worker closure to the scheduler at the same time the worker's connection is breaking from it shutting down).

Easy enough, but I'd rather focus on the core change here and not increase the scope.

`reconnect=True` (previous default) is now the only option. This is not a necessary change to make. It just simplifies things not not have it. See discussion in dask#6361 (comment).

Unnecessary after dask#6361 is merged

was leaking file descriptors, and needed an event loop

gjoseph92 · 2022-05-20T02:02:17Z

Failures (both seem flaky):

flaky test_nanny_worker_port_range #6045 https://github.com/dask/distributed/runs/6516866438?check_suite_focus=true#step:11:1923
distributed/cli/tests/test_dask_scheduler.py::test_preload_config https://github.com/dask/distributed/runs/6516866324?check_suite_focus=true#step:11:1747

gjoseph92 · 2022-05-20T02:03:06Z

@fjetter @mrocklin I believe this is ready for final review. Follow-ups identified in #6384.

mrocklin · 2022-05-20T12:26:59Z

Thanks @gjoseph92 . This is in.

gjoseph92 force-pushed the remove-reconnect branch from 3ce0c96 to c708ef0 Compare May 18, 2022 17:32

gjoseph92 added 14 commits May 18, 2022 14:03

Remove reconnect on worker side

e3aa214

Prohibit reconnect on scheduler

6bfebdc

Don't pass reconnect from Nanny

ef3ae02

The Nanny can still restart the worker process.

Close bad workers in gather instead of reconnect

6dad82d

Test worker close on reconnect

225bbad

fix distributed/tests/test_client.py::test_reconnect

3c9cd15

update distributed/tests/test_scheduler.py::test_worker_arrives_with_…

bc908bd

…processing_data

remove test_worker_breaks_and_returns

115d61b

remove test_no_workers_to_memory, `test_no_worker_to_memory_restric…

0442aa2

…tions`

fix test_close_on_disconnect

eddf8dc

fix test_heartbeat_comm_closed

0b7402a

remove test_worker_reconnects_mid_compute and test_worker_reconnects_…

223b942

…mid_compute_multiple_states_on_scheduler

remove test_no_reconnect and test_reconnect

953cc8c

remove test_worker_reconnect_task_memory, `test_worker_reconnect_ta…

7c769cd

…sk_memory_with_resources`

gjoseph92 force-pushed the remove-reconnect branch from 821b99e to 7c769cd Compare May 18, 2022 20:05

fixup! Test worker close on reconnect

ac00495

gjoseph92 commented May 18, 2022

View reviewed changes

mrocklin reviewed May 19, 2022

View reviewed changes

fjetter mentioned this pull request May 19, 2022

update_who_has can remove workers #6342

Merged

gjoseph92 commented May 19, 2022

View reviewed changes

distributed/cli/dask_worker.py Show resolved Hide resolved

distributed/scheduler.py Show resolved Hide resolved

distributed/scheduler.py Show resolved Hide resolved

distributed/tests/test_client.py Show resolved Hide resolved

distributed/tests/test_scheduler.py Show resolved Hide resolved

gjoseph92 added 3 commits May 19, 2022 13:27

add test_shutdown_on_scheduler_comm_closed

eaa0f74

test_new_worker_with_data_rejected

807c57a

Also make worker log whatever error the scheduler gives it upon connection

test_reconnect could still be flaky

99b52b2

Deprecation warning for --reconnect/--no-reconnect

ba78fa2

gjoseph92 mentioned this pull request May 19, 2022

Remove reconnect argument from Nanny #6383

Draft

2 tasks

gjoseph92 added a commit to gjoseph92/distributed that referenced this pull request May 19, 2022

don't pass to worker

1996f81

Unnecessary after dask#6361 is merged

fix test_reconnect_argument_deprecated

305a23f

was leaking file descriptors, and needed an event loop

This was referenced May 20, 2022

Worker reconnect removal follow-ups #6384

Open

Consider connection-failure worker closures as safe? #6386

Open

Restart worker via Nanny on connection failure #6387

Open

Worker addresses are treated as unique identifiers, but may not be #6392

Open

gjoseph92 marked this pull request as ready for review May 20, 2022 02:02

mrocklin merged commit f669f06 into dask:main May 20, 2022

fjetter mentioned this pull request May 24, 2022

Scheduler worker reconnect drops messages #6341

Closed

2 tasks

gjoseph92 deleted the remove-reconnect branch May 24, 2022 20:00

This was referenced May 25, 2022

Make BatchedSend restartable #6329

Closed

Flaky test test_worker_reconnects_mid_compute* #5621

Closed

This was referenced Jun 7, 2022

Deadlock: all keys in memory, but Futures not done #6285

Closed

Fix Scheduler.restart logic #6504

Merged

Flaky test_quiet_client_close #6540

Open

Flaky test_AllProgress #6550

Open

fjetter mentioned this pull request Jun 28, 2022

Scheduler.reschedule() does not work #6340

Closed

gjoseph92 mentioned this pull request Oct 27, 2022

Investigate and remove unusual scheduler transitions to memory #7210

Closed

fjetter mentioned this pull request Mar 3, 2023

Resilience #1072

Open

	if nanny and self.nanny:
	with self.rpc(self.nanny) as r:
	await r.close_gracefully()

Uh oh!

Remove worker reconnect #6361

Remove worker reconnect #6361

Uh oh!

Conversation

gjoseph92 commented May 18, 2022

Uh oh!

github-actions bot commented May 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit Test Results

Uh oh!

gjoseph92 commented May 18, 2022

Uh oh!

Uh oh!

gjoseph92 May 18, 2022

Choose a reason for hiding this comment

Uh oh!

mrocklin May 19, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mrocklin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mrocklin May 19, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mrocklin commented May 19, 2022

Uh oh!

gjoseph92 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mrocklin commented May 19, 2022

Uh oh!

gjoseph92 commented May 19, 2022

Uh oh!

gjoseph92 commented May 19, 2022

Uh oh!

gjoseph92 commented May 20, 2022

Uh oh!

gjoseph92 commented May 20, 2022

Uh oh!

mrocklin commented May 20, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented May 18, 2022 •

edited

Loading