Skip to content

Conversation

@Carreau
Copy link
Contributor

@Carreau Carreau commented Jun 3, 2020

Looking at the code it was hard to find all the states a worker, nanny
scheduler could be
in, and in particular it is easy to forget 'closing_gracefully' for
example when skimming over a self.status.startswith('clos'), and it
also hard to fine that in some places the status can be "None", "init",
"starting" and "stopped"

So I just took a pass at seeing in how many places status was
assigned/compared in worker, nanny and scheduler and replace that by an
Enum I'm assuming the status each can see is not shared so we end up

I thus tried to do separate Status Enums for Scheduler/Nanny/Worker.

Note that even if Enums have values for their variants, the variants do
not
compare equal to their values. That is to say Status.running =! 'running', but Status.running.value == 'running',

So I implemented a custom __eq__ method that can compare with
strings/None for backward compatibility. It works both when the LHS or
RHS is str/None as the default impl of str/None forward to our
implementation. This also make sure the value you compare with is also
in the list of possible state.

The main other code changes were updating some of the comparisons to use
the enums instead of the strings directly, and replace the "startswith"
with the in operator, which is slightly more verbose.

I believe despite being a bit more verbose, it is a bit more explicit and
may help navigate the codebase and if you want to go this route you
probably want to remove the custom comparison but that would be an API
change that could break other project.

@Carreau Carreau marked this pull request as draft June 3, 2020 23:12
@Carreau
Copy link
Contributor Author

Carreau commented Jun 3, 2020

Mostly an experiment/suggestion, I'm perfectly ok if you are not open to such a change.

This was mostly inspired by trying to figure out if it is possible to "drain" the workers, and wait until they are in a idle state, to try to fix the issue when retiring worker cannot replicate all the keys and fails.

I ended up trying to list all the "status" and where they were used and this helped me a bit.

I think I would also probably be more explicit with "WorkerStatus", "SchedulerStatus" and "NannyStatus", unless some of these states need to be unions of the current ones.

The other routes would have been properties, though I like the explicitness of Enums.

Looking at the code it was hard to find all the states a worker, nanny
scheduler could be
in, and in particular it is easy to forget 'closing_gracefully' for
example when skimming over a `self.status.startswith('clos')`, and it
also hard to fine that in some places the status can be "None", "init",
"starting" and "stopped"

So I just took a pass at seeing in how many places status was
assigned/compared in worker, nanny and scheduler and replace that by an
Enum I'm assuming the status each can see is not shared  so we end up

I thus tried to do separate Status Enums for Scheduler/Nanny/Worker.

Note that even if Enums have values for their variants, the variants _do
not_ compare equal to their values. That is to say `Status.running =!
'running'`, but `Status.running.value == 'running'`,

So I implemented a custom `__eq__` method that can compare with
strings/None for backward compatibility. It works both when the LHS or
RHS is str/None as the default impl of str/None forward to our
implementation. This also make sure the value you compare with is also
in the list of possible state.

The main other code changes were updating some of the comparisons to use
the enums instead of the strings directly, and replace the "startswith"
with the `in` operator, which is slightly more verbose.

I believe despite being a bit more verbose, it is a bit more explicit and
may help navigate the codebase and if you want to go this route you
probably want to remove the custom comparison but that would be an API
change that could break other project.
@Carreau
Copy link
Contributor Author

Carreau commented Jun 4, 2020

Actually it seem from the more extensive testsuite than the one I ran the actually values get compared between Worker and Scheduler. So I'll have to revise this.

@mrocklin
Copy link
Member

mrocklin commented Jun 4, 2020

In principle an approach like this seems sensible to me. It probably makes sense to make things a bit more formal today. This might be a good opportunity for us to think about our statuses more broadly and see if there are opportunities to unify them (it would be nice to have only one Status type).

cc'ing @jcrist , who I think will be made happy by this PR

@jcrist
Copy link
Member

jcrist commented Jun 4, 2020

This might be a good opportunity for us to think about our statuses more broadly and see if there are opportunities to unify them (it would be nice to have only one Status type).

If possible, I agree this would be a good thing to do. The scheduler, worker, and nanny all should have similar states they transition through during execution. The Cluster objects also have a status field, but those may have slightly different states.

@mrocklin
Copy link
Member

mrocklin commented Jun 4, 2020

If possible, I agree this would be a good thing to do. The scheduler, worker, and nanny all should have similar states they transition through during execution. The Cluster objects also have a status field, but those may have slightly different states.

My hope is that we can have one set of states that is consistently used across all components. That would also solve this:

Actually it seem from the more extensive testsuite than the one I ran the actually values get compared between Worker and Scheduler. So I'll have to revise this.

@Carreau
Copy link
Contributor Author

Carreau commented Jun 4, 2020

Thanks, I wasn't sure whether the states were supposed to be shared or at least have a common subset. We Could have a core Enum, and deved enums is my guess (though they will need to not inherit from Enum), and/or restrict the stated on Scheduler/Nany/Worker by making status a property. Let me update that with a single unified status type first and then let's see if we can shave some states or refactor.

Status get compared with each other and we likely want to consolidate
all the status.
@mrocklin
Copy link
Member

mrocklin commented Jun 4, 2020

@lesteve @jacobtomlinson @jcrist thoughts on applying this same change to downstream dask-foo cluster managers?

@jcrist
Copy link
Member

jcrist commented Jun 4, 2020

Dask-Gateway has a status field that's mostly for debouncing close operations - would be fine to use the enum here instead. Dask-Yarn currently has no status field, state is tracked via a couple internal event flags.

@mrocklin
Copy link
Member

mrocklin commented Jun 4, 2020 via email

@Carreau
Copy link
Contributor Author

Carreau commented Jun 4, 2020

A safe first step would be to make all of those status fields properties which convert from strings to the corresponding enum when set and emit a warning, that would be completely backward compatible, and non invasive.

@Carreau
Copy link
Contributor Author

Carreau commented Jun 4, 2020

Which is also relatively easy as they are all subclass of Server.

@Carreau
Copy link
Contributor Author

Carreau commented Jun 5, 2020

Can one of you kick the failing travis build ?

@lesteve
Copy link
Member

lesteve commented Jun 5, 2020

@lesteve @jacobtomlinson @jcrist thoughts on applying this same change to downstream dask-foo cluster managers?

Fine with me, we rely on SpecCluster heavily so that would not affect Dask-Jobqueue clusters too much. A quick git grep shows .status is used twice in all of Dask-Jobqueue (and one of these two occurences happens in a test).

@jacobtomlinson
Copy link
Member

@lesteve @jacobtomlinson @jcrist thoughts on applying this same change to downstream dask-foo cluster managers?

Yeah I support making these changes downstream. And also pulling dask-kubernetes inline.

starting = "starting"
stopped = "stopped"
stopping = "stopping"
undefined = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like closing/closed and stopping/stopped might be redundant. I would be in favor of unifying around one pair of values, presumably closing/closed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a sense for how often users check these values and if there is a way that we should make this change smoothly. I'm inclined to make a small breaking change here, just because I don't think that many people depend on these values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many ways of handling this, in the properties you can emit a warning if you get a given value assigned tell users to use the Enum, and same in the __eq__ method, if you see comparison with the deprecated values, you warn.

Those warnings can be turned into error in tests.

You can also help refactor with having two variants of the enum map to the same value.

_instances = weakref.WeakSet()
process = None
status = None
status = Status.undefined
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if Status.undefined and Status.init should be unified.

@Carreau
Copy link
Contributor Author

Carreau commented Jun 5, 2020

Do you want to do all the unification as part of this PR, or do you want to get something smaller in master and then bikesheb on which one is renamed how later ?

My guess is that the renaming/dropping is backward incompatible and easier to discuss on its own/revert if necessary.

@mrocklin
Copy link
Member

mrocklin commented Jun 5, 2020

Do you want to do all the unification as part of this PR, or do you want to get something smaller in master and then bikesheb on which one is renamed how later ?

My guess is that the renaming/dropping is backward incompatible and easier to discuss on its own/revert if necessary.

Good point. I'm happy to continue on this PR without these changes.

@Carreau Carreau marked this pull request as ready for review June 5, 2020 18:58
@mrocklin
Copy link
Member

mrocklin commented Jun 5, 2020

I think that there is a release in progress today. Let's let that pass through and then we can merge this and let it sit in master for a while to see if any issues arise.

@Carreau
Copy link
Contributor Author

Carreau commented Jun 5, 2020

No problem, and no hurry.

@mrocklin
Copy link
Member

mrocklin commented Jun 8, 2020

OK, merging this in 24 hours if there are no further comments.

@Carreau
Copy link
Contributor Author

Carreau commented Jun 8, 2020

Thanks, sounds great, once done I can change in more places and start adding warnings.

@jacobtomlinson
Copy link
Member

Once this is in we should open issues on downstream projects to track implementing this there too. This blog post I've submitted should be a good overview of the existing cluster managers.

@mrocklin mrocklin merged commit 1408ab5 into dask:master Jun 9, 2020
@mrocklin
Copy link
Member

mrocklin commented Jun 9, 2020

OK, this is in. I'm curious to see how much activity this generates :)

Carreau added a commit to Carreau/distributed that referenced this pull request Jun 9, 2020
Turn this into an error in the test suite.

We want to make sure of various things:

  1) original behavior of comparing/assign with string still works
  but emit warnings.
  1b) assigning strings convert to proper enum variant.
  2) assign/comparison with invalid strings should fail
  2) warnings are errors in test suite to make sure we don't re-introduce
  strings.

This is the continuation of dask#3853
Carreau added a commit to Carreau/distributed that referenced this pull request Jun 11, 2020
Turn this into an error in the test suite.

We want to make sure of various things:

  1) original behavior of comparing/assign with string still works
  but emit warnings.
  1b) assigning strings convert to proper enum variant.
  2) assign/comparison with invalid strings should fail
  2) warnings are errors in test suite to make sure we don't re-introduce
  strings.

This is the continuation of dask#3853

Maybe Cluster in cluster.py should also get status as a property
Carreau added a commit to Carreau/distributed that referenced this pull request Jun 11, 2020
Turn this into an error in the test suite.

We want to make sure of various things:

  1) original behavior of comparing/assign with string still works
  but emit warnings.
  1b) assigning strings convert to proper enum variant.
  2) assign/comparison with invalid strings should fail
  2) warnings are errors in test suite to make sure we don't re-introduce
  strings.

This is the continuation of dask#3853

Maybe Cluster in cluster.py should also get status as a property
Carreau added a commit to Carreau/distributed that referenced this pull request Jul 24, 2020
Turn this into an error in the test suite.

We want to make sure of various things:

  1) original behavior of comparing/assign with string still works
  but emit warnings.
  1b) assigning strings convert to proper enum variant.
  2) assign/comparison with invalid strings should fail
  2) warnings are errors in test suite to make sure we don't re-introduce
  strings.

This is the continuation of dask#3853

Maybe Cluster in cluster.py should also get status as a property
mrocklin pushed a commit that referenced this pull request Aug 4, 2020
Turn this into an error in the test suite.

We want to make sure of various things:

  1) original behavior of comparing/assign with string still works
  but emit warnings.
  1b) assigning strings convert to proper enum variant.
  2) assign/comparison with invalid strings should fail
  2) warnings are errors in test suite to make sure we don't re-introduce
  strings.

This is the continuation of #3853

Maybe Cluster in cluster.py should also get status as a property
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants