Skip to content

Conversation

@fjetter
Copy link
Member

@fjetter fjetter commented Oct 7, 2021

We're removing long running tasks from a workers occupancy. However, occasionally we're recalculating the worker occupancy to account for new information since the occupancy gets inaccurate over time. However, this reevaluation does not take into account long-running tasks since the scheduler does not further break the state processing down. Similar for the substate executing, we're tracking this on a per-worker base and we should do the same for the long-running substate

This builds on top of #5392 since I pulled together the occupancy calculation over there.

Fixes of this PR in last commit 9c20b32

Copy link
Collaborator

@gjoseph92 gjoseph92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, though I'm a little curious why the task needs to stay in processing on the WorkerState


@ccall
@exceptval(check=False)
def _reevaluate_occupancy_worker(self, ws: WorkerState):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the reason for moving this to be a method instead of a standalone function? I agree methods are more readable, just not sure about the Cython consequences cc @jakirkham

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just readability

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both will work. However there was extra overhead from the method call vs. the function call, which is why it was moved to a function. Given this gets called a lot, that overhead mattered in our benchmarks.

@fjetter fjetter force-pushed the long_running_occupancy branch from ab517ea to a151a4b Compare December 10, 2021 14:38
Comment on lines +316 to +317
self.scheduler._reevaluate_occupancy_worker(thief)
self.scheduler._reevaluate_occupancy_worker(victim)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it looks like _reevaluate_occupancy_worker was removed (even as a method), should we be doing something else here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I missed this one. I think this is why I put it in as a mathod to avoid cyclic imports

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW did like the idea you had of getting rid of this function if that is an option. Though admittedly there might need to be more thought on this test

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intuitively, it feels like this should not even be here since this interacts pretty deeply with the scheduler and the stealing thing should be an extension and not interact on this deep level. However, the only way to not do this here would be to solely rely on the eventual update via a callback. No idea how disruptive this would be

@fjetter fjetter force-pushed the long_running_occupancy branch from 0eda3c9 to a151a4b Compare December 15, 2021 17:46
@fjetter
Copy link
Member Author

fjetter commented Dec 17, 2021

@fjetter fjetter merged commit 3494c2b into dask:main Dec 17, 2021
@fjetter fjetter deleted the long_running_occupancy branch December 17, 2021 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

No parallelism with long-running, seceded tasks; high occupancy prevents work assignment?

3 participants