Skip to content

[Bug] [Workflow] Task timeout kill throw exception(The cancelApplication method was called twice for the shellCommandExecutor task) #17436

@njnu-seafish

Description

@njnu-seafish

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

1. Create a shell task and configure the timeout failure strategy

Image

2. Manually kill the task, and the logs show kill success operation. (Only call the cancelApplication method once.)

2025-08-15 13:49:33.105 INFO [WorkerRpcServer-methodInvoker-224] - Publish TaskExecutorKillLifecycleEvent: {
"taskInstanceId" : 1081,
"eventCreateTime" : 1755236973105,
"type" : "KILL"
}
2025-08-15 13:49:33.147 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-87] - Begin killing task instance, processId: 749659
2025-08-15 13:49:33.449 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-87] - prepare to parse pid, raw pid string: sudo(749659)---1081.sh(749674)---sleep(749748)

2025-08-15 13:49:34.003 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-87] - Sending SIGINT to process group: 749659 749674 749748, command: sudo -u dolphinscheduler -i kill -s SIGINT 749659 749674 749748
2025-08-15 13:49:44.992 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-87] - Kill command: sudo -u dolphinscheduler -i kill -s SIGINT 749659 749674 749748, timed out, still running PIDs: 749659 749674 749748
2025-08-15 13:49:45.545 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-87] - Sending SIGTERM to process group: 749659 749674 749748, command: sudo -u dolphinscheduler -i kill -s SIGTERM 749659 749674 749748
2025-08-15 13:49:46.253 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-87] - Successfully killed process tree using SIGTERM, processId: 749659
2025-08-15 13:49:46.254 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-87] - Process tree for task: 1081 is killed or already finished, pid: 749659
2025-08-15 13:49:46.254 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-87] - Get appIds from worker 192.168.30.121:1234, taskLogPath: /data01/dolphinscheduler/20250815/149143631011392/1/1015/1081.log
2025-08-15 13:49:46.254 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-87] - Start finding appId in /data01/dolphinscheduler/20250815/149143631011392/1/1015/1081.log, fetch way: log
2025-08-15 13:49:46.254 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-87] - The appId is empty
2025-08-15 13:49:46.254 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-87] - Success fire TaskExecutorKillLifecycleEvent: {
"taskInstanceId" : 1081,
"eventCreateTime" : 1755236973105,
"type" : "KILL"
}
2025-08-15 13:49:46.360 INFO [exclusive-task-executor-container-worker-0] - process has exited. execute path:/data01/dolphinscheduler/exec/process/1081, processId:749659 ,exitStatusCode:143 ,processWaitForStatus:true ,processExitValue:143

3, However, an exception was thrown when killing due to timeout. (The cancelApplication method was called twice.)

2025-08-15 16:55:37.289 INFO [WorkerRpcServer-methodInvoker-31] - Publish TaskExecutorKillLifecycleEvent: {
"taskInstanceId" : 1084,
"eventCreateTime" : 1755248137289,
"type" : "KILL"
}
2025-08-15 16:55:37.333 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-0] - Begin killing task instance, processId: 837363
2025-08-15 16:55:37.730 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-0] - prepare to parse pid, raw pid string: sudo(837363)---1084.sh(837379)---sleep(837453)

2025-08-15 16:55:38.316 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-0] - Sending SIGINT to process group: 837363 837379 837453, command: sudo -u dolphinscheduler -i kill -s SIGINT 837363 837379 837453
2025-08-15 16:55:49.325 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-0] - Kill command: sudo -u dolphinscheduler -i kill -s SIGINT 837363 837379 837453, timed out, still running PIDs: 837363 837379 837453
2025-08-15 16:55:49.876 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-0] - Sending SIGTERM to process group: 837363 837379 837453, command: sudo -u dolphinscheduler -i kill -s SIGTERM 837363 837379 837453
2025-08-15 16:55:50.166 ERROR [exclusive-task-executor-container-worker-0] - process has failure, the task timeout configuration value is:60, ready to kill ...
2025-08-15 16:55:50.167 INFO [exclusive-task-executor-container-worker-0] - Begin killing task instance, processId: 837363
2025-08-15 16:55:50.566 INFO [exclusive-task-executor-container-worker-0] - prepare to parse pid, raw pid string:
2025-08-15 16:55:50.567 ERROR [exclusive-task-executor-container-worker-0] - Kill task instance error, processId: 837363
java.lang.NumberFormatException: For input string: ""
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:592)
at java.lang.Integer.parseInt(Integer.java:615)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
at org.apache.dolphinscheduler.plugin.task.api.utils.ProcessUtils.kill(ProcessUtils.java:124)
at org.apache.dolphinscheduler.plugin.task.api.AbstractCommandExecutor.cancelApplication(AbstractCommandExecutor.java:216)
at org.apache.dolphinscheduler.plugin.task.api.AbstractCommandExecutor.run(AbstractCommandExecutor.java:196)
at org.apache.dolphinscheduler.plugin.task.shell.ShellTask.handle(ShellTask.java:85)
at org.apache.dolphinscheduler.server.worker.executor.PhysicalTaskExecutor.doTriggerTaskPlugin(PhysicalTaskExecutor.java:74)
at org.apache.dolphinscheduler.task.executor.AbstractTaskExecutor.start(AbstractTaskExecutor.java:80)
at org.apache.dolphinscheduler.task.executor.worker.TaskExecutorWorker.start(TaskExecutorWorker.java:65)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
2025-08-15 16:55:50.567 ERROR [exclusive-task-executor-container-worker-0] - Failed to kill process tree for task: 1084, pid: 837363
2025-08-15 16:55:50.567 INFO [exclusive-task-executor-container-worker-0] - Get appIds from worker 192.168.30.121:1234, taskLogPath: /data01/dolphinscheduler/20250815/149143631011392/1/1018/1084.log
2025-08-15 16:55:50.567 INFO [exclusive-task-executor-container-worker-0] - Start finding appId in /data01/dolphinscheduler/20250815/149143631011392/1/1018/1084.log, fetch way: log
2025-08-15 16:55:50.567 INFO [exclusive-task-executor-container-worker-0] - The appId is empty
2025-08-15 16:55:50.568 INFO [exclusive-task-executor-container-worker-0] - process has exited. execute path:/data01/dolphinscheduler/exec/process/1084, processId:837363 ,exitStatusCode:-1 ,processWaitForStatus:false ,processExitValue:143
2025-08-15 16:55:50.568 INFO [exclusive-task-executor-container-worker-0] - Publish TaskExecutorFailedLifecycleEvent: {
"taskInstanceId" : 1084,
"eventCreateTime" : 1755248150568,
"type" : "FAILED",
"workflowInstanceId" : 1018,
"workflowInstanceHost" : "192.168.30.11:5678",
"taskInstanceHost" : "192.168.30.121:1234",
"appIds" : "",
"endTime" : 1755248150568,
"latestReportTime" : null
}

What you expected to happen

1, Task timeout kill don't throw exception
2, It's best to trigger the kill action only once.

How to reproduce

  1. Create a shell task and configure the timeout failure strategy
Image
  1. Run the workflow, wait to kill the task after timeout

Anything else

No response

Version

dev

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

Labels

backendbugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions