Skip to content

Conversation

@fjetter
Copy link
Member

@fjetter fjetter commented Jan 15, 2021

This is a more minimalistic version of #4360 with fixes I am pretty confident with. I still can see local issue, though and this is incomplete.

@mrocklin
Copy link
Member

I notice that this is still in WIP status. Are the changes here an improvement over what is in master? If so, I might press the "Ready for review" button

@gforsyth
Copy link
Contributor

I'll note that the failures in test_failed_workers.py::test_worker_who_has_clears_after_failed_connection are connected to work stealing -- if stealing is disabled the assertion errors go away

Also, check that we don't release keys with dependents
@fjetter
Copy link
Member Author

fjetter commented Jan 18, 2021

I'm currently testing the latest commit on a few clusters in parallel. So far, no funny logs, no deadlocks, no problems at all.

@fjetter
Copy link
Member Author

fjetter commented Jan 18, 2021

The only thing I could observe, so far is that the waiting_for_data_count is off but I checked the workers, both data and tasks are empty and I suspected a bug in this counter already. I believe this is not critical, though, since the counter is not used internally, merely for instrumentation (The following graph shows states as exposed by prometheus. Not identical to internal states)

image

@mrocklin
Copy link
Member

mrocklin commented Jan 18, 2021 via email

@fjetter fjetter force-pushed the deadloc_from_scratch branch from 975caac to 1fc1b46 Compare January 18, 2021 17:01
@fjetter
Copy link
Member Author

fjetter commented Jan 18, 2021

Needed to push another time. There was actually a typo on my side causing a test to fail.

@fjetter fjetter marked this pull request as ready for review January 18, 2021 17:02
@fjetter fjetter changed the title Set runspec in add_task Set runspec on all new tasks to avoid deadlocks Jan 18, 2021
@fjetter
Copy link
Member Author

fjetter commented Jan 18, 2021

Unfortunately, I needed to push another small change but an expected one where we didn't clear who_has on the task state during release_key. That's not a big deal, I believe, and my tests so far showed this works still nicely.

I'm dropping out now but if there is no big CI thing popping up again, I feel confident merging this now. I'm not sure if it fixes all issues but it should provide a significant improvement to master from what I've seen.

@mrocklin
Copy link
Member

@fjetter are there other issues that you know exist? If so I'm happy to take a look at them today.

@fjetter
Copy link
Member Author

fjetter commented Jan 18, 2021

On the other PR I originally wanted to fix an issue with erroneous tasks which led me to discover this runspec thing. I added a test case with an object with broken deserialization. That should hit the worker internal suspicious counter (but it doesn't/didn't) something else might reveal itself there. Otherwise I don't have anything on my mind right now

@mrocklin
Copy link
Member

OK, once tests pass here I'm going to merge. If things are ok for a bit I suggest that we release. cc @jrbourbeau

@mrocklin mrocklin merged commit 9442d9b into dask:master Jan 18, 2021
@mrocklin
Copy link
Member

I briefly tried things with the exercise in #4430 and things didn't complete. I need to double check that work though.

@fjetter
Copy link
Member Author

fjetter commented Jan 19, 2021

I had already a few successful runs of the example in #4430 and that worked out great. That was using an earlier commit of this PR and I'll double check with master now.

Really happy to see this merged. Thanks for your help @mrocklin !

@fjetter fjetter deleted the deadloc_from_scratch branch January 19, 2021 07:09
@mrocklin
Copy link
Member

Ah, indeed, things with that workflow are good for me too as well now. Good for a release whenever makes sense @jrbourbeau . I suspect that you're aiming for this Friday.

@jrbourbeau
Copy link
Member

xref dask/community#121

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants