Skip to content

KeyError: ('error', 'waiting') #4800

@mdering

Description

@mdering

What happened:
Sometimes the worker logs indicate the following KeyError: ('error', 'waiting'):

ERROR:tornado.application:Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x0000000003C90C88>>, <Task finished coro=<Worker.handle_scheduler() done, defined at \miniconda3\lib\site-packages\distributed\worker.py:997> exception=KeyError(('error', 'waiting'))>)
Traceback (most recent call last):
  File "miniconda3\lib\site-packages\tornado\ioloop.py", line 741, in _run_callback
    ret = callback()
  File "miniconda3\lib\site-packages\tornado\ioloop.py", line 765, in _discard_future_result
    future.result()
  File "\miniconda3\lib\site-packages\distributed\worker.py", line 1000, in handle_scheduler
    comm, every_cycle=[self.ensure_communicating, self.ensure_computing]
  File "miniconda3\lib\site-packages\distributed\core.py", line 573, in handle_stream
    handler(**merge(extra, msg))
  File "\miniconda3\lib\site-packages\distributed\worker.py", line 1502, in add_task
    self.transition(ts, "waiting", runspec=runspec)
  File "\miniconda3\lib\site-packages\distributed\worker.py", line 1602, in transition
    func = self._transitions[start, finish]
KeyError: ('error', 'waiting')

right before the worker seems to lose connection, eventually the TTL lapses and the worker dies. Simultaneous to this, the scheduler logs indicate the following

ERROR - 2021-05-08 02:18:25 - distributed.utils.log_errors.l673 - 'TASK_KEY'
Traceback (most recent call last):
  File "\miniconda3\lib\site-packages\distributed\utils.py", line 668, in log_errors
    yield
  File "\miniconda3\lib\site-packages\distributed\scheduler.py", line 3986, in add_worker
    typename=types[key],
KeyError: 'TASK_KEY'
ERROR - 2021-05-08 02:18:25 - distributed.core.handle_comm.l507 - Exception while handling op register-worker
Traceback (most recent call last):
  File "\miniconda3\lib\site-packages\distributed\core.py", line 501, in handle_comm
    result = await result
  File "\miniconda3\lib\site-packages\distributed\scheduler.py", line 3986, in add_worker
    typename=types[key],
KeyError: 'TASK_KEY'

I'm not totally sure how to handle this, so I've put in some fixes where we handle errors more generally. However, when I pull at the worker source code, i see this line:

if ts.state == "erred":

which handles if the status is set to erred but not error. Could this be an oversight? Or is there something else I'm missing? I'm not clear on whether this will properly transition the task, and then perhaps the scheduler types lookup above might not fail. Or, could the scheduler type lookup need to handle these key errors?=

Anything else we need to know?:
I'm sorry I once again do not have a minimal working example. I hope this is enough to go on.
Environment:

  • Dask version: 2021.04.1
  • Python version: 3.7
  • Operating System: windows
  • Install method (conda, pip, source): conda

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions