Description
Sometimes due to large shuffles executors will take a long time shutting down. In particular this can happen if large numbers of shuffle files are around (this will be alleviated by SPARK-1103, but nonetheless...).
The symptom is you have DEAD workers sitting around in the UI and the existing workers keep trying to re-register but can't because they've been assumed dead.
If killing the executor happens in its own thread, or if the ExecutorRunner were an actor, this would not be a problem. For 0.9 I'd prefer the former approach since it minimizes code changes.