WIP Proper handling of application errors during get_data #4316

fjetter · 2020-12-04T16:25:48Z

TL;DR If we encounter application errors during get_data (e.g. custom deserializer) we are somehow reusing comm objects and are swallowing handler calls which might drive the cluster into a corrupt state (sometimes self healing)

This is a funny one, again. It is not a fix, yet, but an analysis of the problem with a semi-complete test (not sure, yet, what the outcome of the test should be exactly but it is reproducing it simply)

A tiny bit of context: We had some more or less deterministic deserialization issues connected to custom (de-)serialization code. We fixed this one so I'm not sure how critical this problem actually is but from my understanding it could put the cluster in a corrupt state, at least for a while. I've seen retries eventually recovering it so I guess this is just weird but not critical.

On top of the "Failed to deserialize" messages (which are obviously expected) we got things like

  File "/mnt/mesos/sandbox/venv/lib/python3.6/site-packages/distributed/worker.py", line 1283, in get_data
    assert response == "OK", response

AssertionError: {'op': 'get_data', 'keys': ("('shuffle-split-7371cf941c2c38490c6b2db8b38e4e98', 2, 1, (6, 2))", "('shuffle-split-7371cf941c2c38490c6b2db8b38e4e98', 2, 1, (6, 1))"), 'who': 'tls://10.3.136.133:31114', 'max_connections': None, 'reply': True}

If you follow the XXX comments from 1 to 4 this leads through the code sequentially as the events happen to produce this issue.

fjetter added 2 commits December 4, 2020 17:04

Deserialization error during get_data

96b8009

Test for broken dep

650b86e

fjetter mentioned this pull request Dec 14, 2020

Deadlocks and infinite loops connected to failed dependencies #4360

Closed

fjetter closed this Jan 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

WIP Proper handling of application errors during get_data #4316

WIP Proper handling of application errors during get_data #4316

Uh oh!

fjetter commented Dec 4, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

WIP Proper handling of application errors during get_data #4316

WIP Proper handling of application errors during get_data #4316

Uh oh!

Conversation

fjetter commented Dec 4, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant