-
Notifications
You must be signed in to change notification settings - Fork 5k
Description
Search before asking
- I had searched in the issues and found no similar issues.
What happened
I try to kill spark job on yarn but failed.
Logs show that "yarn: command not found"
After fixing this, Logs show that kill yarn application failed with ExitCodeException. The exit code is 0, but errMsg is not null
there's the first logs:
2025-07-22 10:48:27.128 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-36] - kill cmd:sudo -u dolphinscheduler sh /data01/dolphinscheduler/exec/process/147/application_1749462877863_5866.kill
2025-07-22 10:48:27.151 ERROR [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-36] - Kill yarn application [[application_1749462877863_5866]] failed
org.apache.dolphinscheduler.common.shell.AbstractShell$ExitCodeException: /data01/dolphinscheduler/exec/process/147/application_1749462877863_5866.kill: line 10: yarn: command not found
at org.apache.dolphinscheduler.common.shell.AbstractShell.runCommand(AbstractShell.java:205)
at org.apache.dolphinscheduler.common.shell.AbstractShell.run(AbstractShell.java:118)
at org.apache.dolphinscheduler.common.shell.ShellExecutor.execute(ShellExecutor.java:125)
at org.apache.dolphinscheduler.common.shell.ShellExecutor.execCommand(ShellExecutor.java:103)
at org.apache.dolphinscheduler.common.shell.ShellExecutor.execCommand(ShellExecutor.java:86)
at org.apache.dolphinscheduler.common.utils.OSUtils.exeShell(OSUtils.java:342)
at org.apache.dolphinscheduler.common.utils.OSUtils.exeCmd(OSUtils.java:331)
at org.apache.dolphinscheduler.plugin.task.api.am.YarnApplicationManager.execYarnKillCommand(YarnApplicationManager.java:91)
at org.apache.dolphinscheduler.plugin.task.api.am.YarnApplicationManager.killApplication(YarnApplicationManager.java:51)
at org.apache.dolphinscheduler.plugin.task.api.am.YarnApplicationManager.killApplication(YarnApplicationManager.java:38)
at org.apache.dolphinscheduler.plugin.task.api.utils.ProcessUtils.cancelApplication(ProcessUtils.java:345)
at org.apache.dolphinscheduler.plugin.task.api.AbstractCommandExecutor.cancelApplication(AbstractCommandExecutor.java:226)
at org.apache.dolphinscheduler.plugin.task.api.AbstractYarnTask.cancelApplication(AbstractYarnTask.java:91)
at org.apache.dolphinscheduler.plugin.task.api.AbstractRemoteTask.cancel(AbstractRemoteTask.java:39)
at org.apache.dolphinscheduler.server.worker.executor.PhysicalTaskExecutor.kill(PhysicalTaskExecutor.java:102)
at org.apache.dolphinscheduler.task.executor.listener.TaskExecutorLifecycleEventListener.onTaskExecutorKillLifecycleEvent(TaskExecutorLifecycleEventListener.java:88)
at org.apache.dolphinscheduler.task.executor.eventbus.TaskExecutorEventBusCoordinator.doFireTaskExecutorEventBus(TaskExecutorEventBusCoordinator.java:166)
at org.apache.dolphinscheduler.task.executor.eventbus.TaskExecutorEventBusCoordinator.lambda$fireTaskExecutorEventBus$1(TaskExecutorEventBusCoordinator.java:123)
at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1646)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
2025-07-22 10:48:27.151 ERROR [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-36] - Cancel application failed: /data01/dolphinscheduler/exec/process/147/application_1749462877863_5866.kill: line 10: yarn: command not found
After fixing this, The second logs:
2025-07-22 14:45:15.928 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-97] - Successfully killed process tree using SIGTERM, processId: 1219746
2025-07-22 14:45:15.928 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-97] - Process tree for task: 150 is killed or already finished, pid: 1219746
2025-07-22 14:45:15.928 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-97] - Get appIds from worker xxxxx:1234, taskLogPath: /data01/dolphinscheduler/20250722/145403649079392/5/103/150.log
2025-07-22 14:45:15.928 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-97] - Start finding appId in /data01/dolphinscheduler/20250722/145403649079392/5/103/150.log, fetch way: log
2025-07-22 14:45:15.929 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-97] - Find appId: application_1749462877863_5903 from /data01/dolphinscheduler/20250722/145403649079392/5/103/150.log
2025-07-22 14:45:15.930 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-97] - get kerberos init command
2025-07-22 14:45:15.930 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-97] - kerberos init command: export KRB5_CONFIG=/etc/krb5.conf
kinit -k -t /etc/security/keytabs/hdfs.keytab hdfs/xxxxx || true
2025-07-22 14:45:15.930 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-97] - kill cmd:sudo -u dolphinscheduler -i sh /data01/dolphinscheduler/exec/process/150/application_1749462877863_5903.kill
2025-07-22 14:45:17.398 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-97] - exitCode: 0
2025-07-22 14:45:17.399 ERROR [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-97] - Kill yarn application [[application_1749462877863_5903]] failed
org.apache.dolphinscheduler.common.shell.AbstractShell$ExitCodeException: 2025-07-22 14:45:17,383 | INFO | impl.YarnClientImpl | Killed application application_1749462877863_5903
at org.apache.dolphinscheduler.common.shell.AbstractShell.runCommand(AbstractShell.java:206)
at org.apache.dolphinscheduler.common.shell.AbstractShell.run(AbstractShell.java:118)
at org.apache.dolphinscheduler.common.shell.ShellExecutor.execute(ShellExecutor.java:125)
at org.apache.dolphinscheduler.common.shell.ShellExecutor.execCommand(ShellExecutor.java:103)
at org.apache.dolphinscheduler.common.shell.ShellExecutor.execCommand(ShellExecutor.java:86)
at org.apache.dolphinscheduler.common.utils.OSUtils.exeShell(OSUtils.java:343)
at org.apache.dolphinscheduler.common.utils.OSUtils.exeCmd(OSUtils.java:332)
at org.apache.dolphinscheduler.plugin.task.api.am.YarnApplicationManager.execYarnKillCommand(YarnApplicationManager.java:91)
at org.apache.dolphinscheduler.plugin.task.api.am.YarnApplicationManager.killApplication(YarnApplicationManager.java:51)
at org.apache.dolphinscheduler.plugin.task.api.am.YarnApplicationManager.killApplication(YarnApplicationManager.java:38)
at org.apache.dolphinscheduler.plugin.task.api.utils.ProcessUtils.cancelApplication(ProcessUtils.java:345)
at org.apache.dolphinscheduler.plugin.task.api.AbstractCommandExecutor.cancelApplication(AbstractCommandExecutor.java:226)
at org.apache.dolphinscheduler.plugin.task.api.AbstractYarnTask.cancelApplication(AbstractYarnTask.java:91)
at org.apache.dolphinscheduler.plugin.task.api.AbstractRemoteTask.cancel(AbstractRemoteTask.java:39)
at org.apache.dolphinscheduler.server.worker.executor.PhysicalTaskExecutor.kill(PhysicalTaskExecutor.java:102)
at org.apache.dolphinscheduler.task.executor.listener.TaskExecutorLifecycleEventListener.onTaskExecutorKillLifecycleEvent(TaskExecutorLifecycleEventListener.java:88)
at org.apache.dolphinscheduler.task.executor.eventbus.TaskExecutorEventBusCoordinator.doFireTaskExecutorEventBus(TaskExecutorEventBusCoordinator.java:166)
at org.apache.dolphinscheduler.task.executor.eventbus.TaskExecutorEventBusCoordinator.lambda$fireTaskExecutorEventBus$1(TaskExecutorEventBusCoordinator.java:123)
at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1646)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
2025-07-22 14:45:17.399 ERROR [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-97] - Cancel application failed: 2025-07-22 14:45:17,383 | INFO | impl.YarnClientImpl | Killed application application_1749462877863_5903
What you expected to happen
dolphinscheduler terminate yarn application successfully.
How to reproduce
I don't know if it's because my environment is special, but I've been failing consistently on my end.
Has anyone else encountered a similar problem?
Anything else
For the first question, my test is as follows.
[root@xxxxx][~]
sudo -u dolphinscheduler yarn version
sudo: yarn: command not found
[root@xxxxx][~]
sudo -u dolphinscheduler -i yarn version
Hadoop 3.3.3
Source code repository Unknown -r Unknown
Compiled by root on 2023-07-31T01:58Z
Compiled with protoc 3.7.1
From source with checksum 9437955990f3957351278654266784fc
This command was run using /usr/local/hadoop-3.3.3_ccdp3.3.3_1.0.2/share/hadoop/common/hadoop-common-3.3.3.jar
[root@xxxxx][~]
su - dolphinscheduler
[dolphinscheduler@xxxxx][~]
$ yarn version
Hadoop 3.3.3
Source code repository Unknown -r Unknown
Compiled by root on 2023-07-31T01:58Z
Compiled with protoc 3.7.1
From source with checksum 9437955990f3957351278654266784fc
This command was run using /usr/local/hadoop-3.3.3_ccdp3.3.3_1.0.2/share/hadoop/common/hadoop-common-3.3.3.jar
For the second question, my test is as follows.
[root@xxxxx][/usr/local/dolphinscheduler]
yarn application -kill application_1749462877863_5866
Killing application application_1749462877863_5866
2025-07-22 14:03:59,361 | INFO | impl.YarnClientImpl | Killed application application_1749462877863_5866
[root@xxxxx][/usr/local/dolphinscheduler]
echo $?
Version
3.3.0-alpha
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct