Skip to content

BaseException in task leads to task never completing #5958

@gjoseph92

Description

@gjoseph92

What happened:

If a task raises a BaseException (KeyboardInterrupt, SystemExit, etc.), the task will appear to be processing forever.

What you expected to happen:

The task should definitely not deadlock. But what should actually happen, I'm not sure. Could go two ways:

  1. The entire worker should shut down gracefully.
  2. We should catch them just like any other exceptions and error the task.

More discussion in comments.

Minimal Complete Verifiable Example:

In [1]: import distributed

In [2]: client = distributed.Client(n_workers=1)

In [3]: def raiser():
   ...:     raise BaseException("this could be a KeyboardInterrupt!")
   ...: 

In [4]: f = client.submit(raiser)

In [5]: Exception in callback IOLoop.add_future.<locals>.<lambda>(<Task finishe...dInterrupt!')>) at /Users/gabe/miniconda3/envs/dask-distributed/lib/python3.9/site-packages/tornado/ioloop.py:688
handle: <Handle IOLoop.add_future.<locals>.<lambda>(<Task finishe...dInterrupt!')>) at /Users/gabe/miniconda3/envs/dask-distributed/lib/python3.9/site-packages/tornado/ioloop.py:688>
Traceback (most recent call last):
  File "/Users/gabe/miniconda3/envs/dask-distributed/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/Users/gabe/miniconda3/envs/dask-distributed/lib/python3.9/site-packages/tornado/ioloop.py", line 688, in <lambda>
    lambda f: self._run_callback(functools.partial(callback, future))
  File "/Users/gabe/miniconda3/envs/dask-distributed/lib/python3.9/site-packages/tornado/ioloop.py", line 741, in _run_callback
    ret = callback()
  File "/Users/gabe/miniconda3/envs/dask-distributed/lib/python3.9/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
    future.result()
  File "/Users/gabe/dev/distributed/distributed/worker.py", line 3504, in execute
    result = await self.loop.run_in_executor(
  File "/Users/gabe/dev/distributed/distributed/_concurrent_futures_thread.py", line 65, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/Users/gabe/dev/distributed/distributed/worker.py", line 4503, in apply_function
    msg = apply_function_simple(function, args, kwargs, time_delay)
  File "/Users/gabe/dev/distributed/distributed/worker.py", line 4525, in apply_function_simple
    result = function(*args, **kwargs)
  File "<ipython-input-3-7f701e80695b>", line 2, in raiser
BaseException: this could be a KeyboardInterrupt!
In [5]: 

In [5]: client.processing()
Out[5]: {'tcp://127.0.0.1:58316': ('raiser-e3b7ab59305f9e4ddb4ecddd75c55f85',)}

In [6]: client.call_stack()
Out[6]: 
{'tcp://127.0.0.1:58316': {'raiser-e3b7ab59305f9e4ddb4ecddd75c55f85': ('  File "/Users/gabe/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 912, in _bootstrap\n\tself._bootstrap_inner()\n',
   '  File "/Users/gabe/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 954, in _bootstrap_inner\n\tself.run()\n',
   '  File "/Users/gabe/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 892, in run\n\tself._target(*self._args, **self._kwargs)\n',
   '  File "/Users/gabe/dev/distributed/distributed/threadpoolexecutor.py", line 51, in _worker\n\ttask = work_queue.get(timeout=1)\n',
   '  File "/Users/gabe/miniconda3/envs/dask-distributed/lib/python3.9/queue.py", line 180, in get\n\tself.not_empty.wait(remaining)\n',
   '  File "/Users/gabe/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 316, in wait\n\tgotit = waiter.acquire(True, timeout)\n')}}

In [7]: client.submit(lambda: 1).result(timeout=5)  # the worker still works fine; just that task is stuck now
Out[7]: 1

Anything else we need to know?:

Environment:

  • Dask version: 2022.2.1
  • Python version: 3.9.5
  • Operating System: macOS
  • Install method (conda, pip, source): source

cc @fjetter @graingert @crusaderky

Metadata

Metadata

Assignees

Labels

bugSomething is brokendeadlockThe cluster appears to not make any progressstabilityIssue or feature related to cluster stability (e.g. deadlock)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions