[SPARK-1104] Worker should not block while killing executors - ASF Jira

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.9.0, 1.0.0
Fix Version/s: 1.0.0
Component/s: Deploy
Labels:
None

Description

Sometimes due to large shuffles executors will take a long time shutting down. In particular this can happen if large numbers of shuffle files are around (this will be alleviated by ~~SPARK-1103~~, but nonetheless...).

The symptom is you have DEAD workers sitting around in the UI and the existing workers keep trying to re-register but can't because they've been assumed dead.

If killing the executor happens in its own thread, or if the ExecutorRunner were an actor, this would not be a problem. For 0.9 I'd prefer the former approach since it minimizes code changes.

Attachments

Activity

People

Assignee:: Nan Zhu

Reporter:: Patrick McFadin

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 18/Feb/14 15:56

Updated:: 24/Apr/14 22:57

Resolved:: 24/Apr/14 22:57