Skip to content

Conversation

@fjetter
Copy link
Member

@fjetter fjetter commented Dec 4, 2020

TL;DR If we encounter application errors during get_data (e.g. custom deserializer) we are somehow reusing comm objects and are swallowing handler calls which might drive the cluster into a corrupt state (sometimes self healing)

This is a funny one, again. It is not a fix, yet, but an analysis of the problem with a semi-complete test (not sure, yet, what the outcome of the test should be exactly but it is reproducing it simply)

A tiny bit of context: We had some more or less deterministic deserialization issues connected to custom (de-)serialization code. We fixed this one so I'm not sure how critical this problem actually is but from my understanding it could put the cluster in a corrupt state, at least for a while. I've seen retries eventually recovering it so I guess this is just weird but not critical.

On top of the "Failed to deserialize" messages (which are obviously expected) we got things like

  File "/mnt/mesos/sandbox/venv/lib/python3.6/site-packages/distributed/worker.py", line 1283, in get_data
    assert response == "OK", response

AssertionError: {'op': 'get_data', 'keys': ("('shuffle-split-7371cf941c2c38490c6b2db8b38e4e98', 2, 1, (6, 2))", "('shuffle-split-7371cf941c2c38490c6b2db8b38e4e98', 2, 1, (6, 1))"), 'who': 'tls://10.3.136.133:31114', 'max_connections': None, 'reply': True}

If you follow the XXX comments from 1 to 4 this leads through the code sequentially as the events happen to produce this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant