Skip to content

Conversation

@njnu-seafish
Copy link
Contributor

Purpose of the pull request

close #17436

Brief change log

  1. add string blank check
  2. remove redundant cancelApplication call

Verify this pull request

After fixing the bug, timeout kill logs:

2025-08-21 09:30:25.259 INFO [WorkerRpcServer-methodInvoker-155] - Publish TaskExecutorKillLifecycleEvent: {
"taskInstanceId" : 1341,
"eventCreateTime" : 1755739825259,
"type" : "KILL"
}
2025-08-21 09:30:25.302 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-22] - Begin killing task instance, processId: 1115493
2025-08-21 09:30:25.826 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-22] - prepare to parse pid, raw pid string: sudo(1115493)---1341.sh(1115497)---sleep(1115570)

2025-08-21 09:30:26.370 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-22] - Sending SIGINT to process group: 1115493 1115497 1115570, command: sudo -u dolphinscheduler -i kill -s SIGINT 1115493 1115497 1115570
2025-08-21 09:30:37.408 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-22] - Kill command: sudo -u dolphinscheduler -i kill -s SIGINT 1115493 1115497 1115570, timed out, still running PIDs: 1115493 1115497 1115570
2025-08-21 09:30:37.961 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-22] - Sending SIGTERM to process group: 1115493 1115497 1115570, command: sudo -u dolphinscheduler -i kill -s SIGTERM 1115493 1115497 1115570
2025-08-21 09:30:38.156 ERROR [exclusive-task-executor-container-worker-1] - process has failure due to timeout kill, the timeout value is:60, the taskTimeoutStrategy is:FAILED
2025-08-21 09:30:38.157 INFO [exclusive-task-executor-container-worker-1] - process has killed. execute path:/data01/dolphinscheduler/exec/process/1341, processId:1115493 ,exitStatusCode:-1 ,processWaitForStatus:false ,processExitValue:143
2025-08-21 09:30:38.157 INFO [exclusive-task-executor-container-worker-1] - Publish TaskExecutorFailedLifecycleEvent: {
"taskInstanceId" : 1341,
"eventCreateTime" : 1755739838157,
"type" : "FAILED",
"workflowInstanceId" : 1273,
"workflowInstanceHost" : "192.168.30.11:5678",
"taskInstanceHost" : "192.168.30.121:1234",
"appIds" : null,
"endTime" : 1755739838157,
"latestReportTime" : null
}
2025-08-21 09:30:38.157 INFO [exclusive-task-executor-container-worker-1] - Publish TaskExecutorFinalizeLifecycleEvent: {
"taskInstanceId" : 1341,
"eventCreateTime" : 1755739838157,
"type" : "FINALIZE"
}
2025-08-21 09:30:38.676 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-22] - Successfully killed process tree using SIGTERM, processId: 1115493
2025-08-21 09:30:38.676 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-22] - Process tree for task: 1341 is killed or already finished, pid: 1115493
2025-08-21 09:30:38.677 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-22] - Get appIds from worker 192.168.30.121:1234, taskLogPath: /data01/dolphinscheduler/20250821/149143631011392/7/1273/1341.log
2025-08-21 09:30:38.677 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-22] - Start finding appId in /data01/dolphinscheduler/20250821/149143631011392/7/1273/1341.log, fetch way: log
2025-08-21 09:30:38.677 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-22] - The appId is empty
2025-08-21 09:30:38.677 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-22] - Success fire TaskExecutorKillLifecycleEvent: {
"taskInstanceId" : 1341,
"eventCreateTime" : 1755739825259,
"type" : "KILL"
}
2025-08-21 09:30:38.718 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-20] - Success fire TaskExecutorFailedLifecycleEvent: {
"taskInstanceId" : 1341,
"eventCreateTime" : 1755739838157,
"type" : "FAILED",
"workflowInstanceId" : 1273,
"workflowInstanceHost" : "192.168.30.11:5678",
"taskInstanceHost" : "192.168.30.121:1234",
"appIds" : null,
"endTime" : 1755739838157,
"latestReportTime" : null
}
2025-08-21 09:30:38.768 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-21] -


********************************* Finalize task instance ************************************


2025-08-21 09:30:38.768 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-21] - FINALIZE_SESSION
2025-08-21 09:30:38.768 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-21] - Success fire TaskExecutorFinalizeLifecycleEvent: {
"taskInstanceId" : 1341,
"eventCreateTime" : 1755739838157,
"type" : "FINALIZE"
}

Pull Request Notice

Pull Request Notice

If your pull request contains incompatible change, you should also add it to docs/docs/en/guide/upgrade/incompatible.md

@njnu-seafish njnu-seafish changed the title [Fix 17436][Workflow]Task timeout kill throw exception [Fix-17436][Workflow]Task timeout kill throw exception Aug 21, 2025
@SbloodyS SbloodyS added the bug Something isn't working label Aug 21, 2025
@SbloodyS SbloodyS added this to the 3.3.1 milestone Aug 21, 2025
@SbloodyS SbloodyS modified the milestones: 3.3.1, 3.3.2 Aug 25, 2025
@SbloodyS SbloodyS requested a review from ruanwenjun August 28, 2025 06:23
Comment on lines +199 to +200
String exitLogMessage = (EXIT_CODE_KILL == exitCode || EXIT_CODE_HARD_KILL == exitCode) ? "process has killed."
: "process has exited.";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to check from status in taskExecutionContext rather then check from the exit code, we don't care about the exit, only kill from ds then we think it's kill.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to check from status in taskExecutionContext rather then check from the exit code, we don't care about the exit, only kill from ds then we think it's kill.

Here it simply outputs the final execution status of the process, indicating whether it exited normally or was killed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

String pids = getPidsStr(processId);
String[] pidArray = PID_PATTERN.split(pids);
if (pidArray.length == 0) {
if (StringUtils.isBlank(pids) || pidArray.length == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ifStringUtils.isEmpty(pids) then we don't need to execute line 117.

Comment on lines 124 to 134
List<Integer> pidList = Arrays.stream(pidArray).filter(StringUtils::isNotBlank)
.map(s -> {
try {
return Integer.parseInt(s.trim());
} catch (NumberFormatException e) {
log.warn("Invalid PID string ignored: {}", s);
return null;
}
})
.filter(Objects::nonNull)
.collect(Collectors.toList());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to change getPidsStr return a validate pid list, rather than parse the pid in up layer.

Comment on lines 127 to 130
return Integer.parseInt(s.trim());
} catch (NumberFormatException e) {
log.warn("Invalid PID string ignored: {}", s);
return null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throw exception here, catch the exception will hide the bug.

Copy link
Member

@ruanwenjun ruanwenjun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty good.

@njnu-seafish
Copy link
Contributor Author

@SbloodyS Could u plz help review when available? Thanks.

Copy link
Member

@SbloodyS SbloodyS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@sonarqubecloud
Copy link

sonarqubecloud bot commented Sep 9, 2025

Quality Gate Failed Quality Gate failed

Failed conditions
42.9% Coverage on New Code (required ≥ 60%)

See analysis details on SonarQube Cloud

@SbloodyS SbloodyS merged commit bfb9f17 into apache:dev Sep 9, 2025
70 of 71 checks passed
davidzollo pushed a commit to davidzollo/dolphinscheduler that referenced this pull request Oct 27, 2025
LeonYoah added a commit to LeonYoah/dolphinscheduler that referenced this pull request Nov 1, 2025
…hod (apache#17005)

   [Fix-17316][Task-API] Add check process status after killing task (apache#17320)
   [Fix-17436][Workflow]Task timeout kill throw exception (apache#17437  改造支持seatunnel集群模式停止任务
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend bug Something isn't working test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] [Workflow] Task timeout kill throw exception(The cancelApplication method was called twice for the shellCommandExecutor task)

3 participants