[Fix-17316][Task-API] Add check process status after killing task #17320

njnu-seafish · 2025-07-03T13:00:12Z

Purpose of the pull request

close #17316

Brief change log

after kill the process, and check process status

Verify this pull request

This pull request is code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(or)

Pull Request Notice

If your pull request contains incompatible change, you should also add it to docs/docs/en/guide/upgrade/incompatible.md

SbloodyS

I've tested this issue #17316 base on your reproduction steps in macos and ubuntu 24.04 and found this issue does not exist, please describe the specific reproduction environment. @njnu-seafish

njnu-seafish · 2025-07-04T15:16:14Z

I've tested this issue #17316 base on your reproduction steps in macos and ubuntu 24.04 and found this issue does not exist, please describe the specific reproduction environment. @njnu-seafish

Sorry, I was very busy during the day and didn't have time to reply.
My environment is centos7.
Below is an example I tested directly on my machine.
`
[dolphinscheduler@xxxxx][~]
$ more test.sh
echo ${JAVE_HOME};
sleep 10m

[dolphinscheduler@xxxxx][~]
$ date
Fri Jul 4 22:59:27 CST 2025

[dolphinscheduler@xxxxx][~]
$ sudo -u dolphinscheduler -i sh test.sh

[dolphinscheduler@xxxxx][~]
$ ps -ef | grep -i test.sh
root 1259617 1258564 0 22:59 pts/5 00:00:00 sudo -u dolphinscheduler -i sh test.sh

[dolphinscheduler@xxxxx][~]
$ pstree -p 1259617
sudo(1259617)───sh(1259618)───sleep(1259696)

// kill -s SIGINT return 0. However, the process was not actually killed.
[dolphinscheduler@xxxxx][~]
$ sudo -u dolphinscheduler kill -s SIGINT 1259617 1259618 1259696

[dolphinscheduler@xxxxx][~]
$ echo $?
0

// wating 5m
[dolphinscheduler@xxxxx][~]
$ date
Fri Jul 4 23:04:19 CST 2025

[dolphinscheduler@xxxxx][~]
$ ps -ef | grep -i test.sh
root 1259617 1258564 0 22:59 pts/5 00:00:00 sudo -u dolphinscheduler -i sh test.sh

[dolphinscheduler@xxxxx][~]
$ pstree -p 1259617
sudo(1259617)───sh(1259618)───sleep(1259696)

`

njnu-seafish · 2025-07-04T15:26:52Z

I've tested this issue #17316 base on your reproduction steps in macos and ubuntu 24.04 and found this issue does not exist, please describe the specific reproduction environment. @njnu-seafish

After fixing the bug according to the code above, the log of the task instance is as follows.
`
Final Shell file is:

****************************** Script Content *****************************************************************

#!/bin/bash

BASEDIR=$(cd dirname $0; pwd)

cd $BASEDIR

source /usr/local/dolphinscheduler/bin/env/dolphinscheduler_env.sh

echo ${JAVE_HOME}

sleep 10m

****************************** Script Content *****************************************************************

Executing shell command : sudo -u dolphinscheduler -i /data01/dolphinscheduler/exec/process/130/130.sh

process start, process id is: 643464

Begin killing task instance, processId: 643464

prepare to parse pid, raw pid string: sudo(643464)---130.sh(643479)---sleep(643559)

Kill command: kill -s SIGINT 643464 643479 643559, trying to terminate process

Kill command: sudo -u dolphinscheduler -i kill -s SIGINT 643464 643479 643559, kill failed, the process: 643464 is still running

Kill command: kill -s SIGTERM 643464 643479 643559, trying to terminate process

Kill command: sudo -u dolphinscheduler -i kill -s SIGTERM 643464 643479 643559, kill succeeded

Successfully killed process tree using SIGTERM, processId: 643464

Process tree for task: 130 is killed or already finished, pid: 643464

`

...r-task-api/src/main/java/org/apache/dolphinscheduler/plugin/task/api/utils/ProcessUtils.java

dolphinscheduler-common/src/main/resources/common.properties

...eduler-e2e/dolphinscheduler-e2e-case/src/test/resources/docker/file-manage/common.properties

...r-task-api/src/main/java/org/apache/dolphinscheduler/plugin/task/api/utils/ProcessUtils.java

SbloodyS · 2025-07-08T02:40:46Z

Please run mvn spotless:apply to format the code. @njnu-seafish

...r-task-api/src/main/java/org/apache/dolphinscheduler/plugin/task/api/utils/ProcessUtils.java

ruanwenjun

LGTM

ruanwenjun · 2025-07-17T08:15:21Z

deploy/kubernetes/dolphinscheduler/Chart.yaml

 # This is the chart version. This version number should be incremented each time you make changes
 # to the chart and its templates, including the app version.
-version: 3.1.0
+version: 3.1.1


Why change this?

Copilot

Pull Request Overview

This pull request enhances the process termination mechanism in DolphinScheduler's Task API by adding proper process status verification after sending kill signals. The enhancement introduces a configurable timeout mechanism that waits for processes to terminate gracefully before escalating to more forceful signals.

Adds process status verification using kill -0 to check if processes are actually terminated after sending kill signals
Introduces a configurable timeout property shell.kill.wait.timeout (default 10 seconds) for waiting between signal escalation
Implements improved test coverage for the new process killing logic with different scenarios

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
ProcessUtils.java	Core logic enhancement with process status checking and timeout-based waiting mechanism
ProcessUtilsTest.java	New test cases covering success scenarios with no alive PIDs, SIGINT success, and kill failure cases
Constants.java	Added new constant for shell kill wait timeout configuration
common.properties (multiple)	Configuration files updated with the new timeout property
values.yaml	Kubernetes deployment configuration updated with new timeout setting
README.md	Documentation updated to include new configuration parameter
Chart.yaml	Version bump from 3.1.0 to 3.1.1

Copilot · 2025-07-17T08:41:31Z

...r-task-api/src/main/java/org/apache/dolphinscheduler/plugin/task/api/utils/ProcessUtils.java

+            while (!alivePidList.isEmpty() && (System.currentTimeMillis() - startTime < timeoutMillis)) {
+                // Remove if process is no longer alive
+                alivePidList.removeIf(pid -> !isProcessAlive(pid, tenantCode));
+                if (!alivePidList.isEmpty()) {
+                    // Wait for a short interval before checking process statuses again, to avoid excessive CPU usage
+                    // from tight-loop polling.
+                    ThreadUtils.sleep(SLEEP_TIME_MILLIS);


[nitpick] The polling mechanism uses a fixed sleep interval which may not be optimal. Consider using exponential backoff or a configurable polling interval to balance responsiveness with CPU usage.

Suggested change

while (!alivePidList.isEmpty() && (System.currentTimeMillis() - startTime < timeoutMillis)) {

// Remove if process is no longer alive

alivePidList.removeIf(pid -> !isProcessAlive(pid, tenantCode));

if (!alivePidList.isEmpty()) {

// Wait for a short interval before checking process statuses again, to avoid excessive CPU usage

// from tight-loop polling.

ThreadUtils.sleep(SLEEP_TIME_MILLIS);

long initialSleepMillis = 100; // Start with 100ms

long maxSleepMillis = 2000; // Cap the sleep interval at 2000ms

long currentSleepMillis = initialSleepMillis;

while (!alivePidList.isEmpty() && (System.currentTimeMillis() - startTime < timeoutMillis)) {

// Remove if process is no longer alive

alivePidList.removeIf(pid -> !isProcessAlive(pid, tenantCode));

if (!alivePidList.isEmpty()) {

// Wait for a dynamically adjusted interval before checking process statuses again

ThreadUtils.sleep(currentSleepMillis);

// Increase the sleep interval exponentially, up to the maximum limit

currentSleepMillis = Math.min(currentSleepMillis * 2, maxSleepMillis);

} else {

// Reset the sleep interval if all processes are terminated

currentSleepMillis = initialSleepMillis;

Copilot · 2025-07-17T08:41:31Z

...sk-api/src/test/java/org/apache/dolphinscheduler/plugin/task/api/utils/ProcessUtilsTest.java

+        // Initialize a counter to track how many times the method is invoked
+        AtomicInteger callCount = new AtomicInteger(0);
+        // Mock the static method OSUtils.exeCmd that matches "kill -0" command
+        mockedOSUtils.when(() -> OSUtils.exeCmd(Mockito.matches(".*kill -0.*")))
+                .thenAnswer(invocation -> {
+                    int count = callCount.incrementAndGet();
+                    // these calls will succeed (simulate process is alive)
+                    if (count == 1 || count == 2) {
+                        return "";
+                    } else {
+                        throw new RuntimeException("Command failed");
+                    }
+                });


The AtomicInteger with hardcoded call count logic (count == 1 || count == 2) makes the test brittle and difficult to understand. Consider using a more explicit mock setup with predefined responses.

Suggested change

// Initialize a counter to track how many times the method is invoked

AtomicInteger callCount = new AtomicInteger(0);

// Mock the static method OSUtils.exeCmd that matches "kill -0" command

mockedOSUtils.when(() -> OSUtils.exeCmd(Mockito.matches(".*kill -0.*")))

.thenAnswer(invocation -> {

int count = callCount.incrementAndGet();

// these calls will succeed (simulate process is alive)

if (count == 1 || count == 2) {

return "";

} else {

throw new RuntimeException("Command failed");

}

});

// Mock the static method OSUtils.exeCmd that matches "kill -0" command

mockedOSUtils.when(() -> OSUtils.exeCmd(Mockito.matches(".*kill -0.*")))

.thenReturn("") // First invocation succeeds (process is alive)

.thenReturn("") // Second invocation succeeds (process is alive)

.thenThrow(new RuntimeException("Command failed")); // Subsequent invocations fail (process is no longer alive)

Copilot · 2025-07-17T08:41:31Z

...r-task-api/src/main/java/org/apache/dolphinscheduler/plugin/task/api/utils/ProcessUtils.java

+                // Remove if process is no longer alive
+                alivePidList.removeIf(pid -> !isProcessAlive(pid, tenantCode));


[nitpick] The while loop modifies alivePidList during iteration with removeIf(). While this is safe for ArrayList, it could be made more explicit by collecting results separately to avoid potential confusion.

Suggested change

// Remove if process is no longer alive

alivePidList.removeIf(pid -> !isProcessAlive(pid, tenantCode));

// Collect PIDs that are still alive

List<Integer> stillAlivePids = alivePidList.stream()

.filter(pid -> isProcessAlive(pid, tenantCode))

.collect(Collectors.toList());

alivePidList = stillAlivePids;

njnu-seafish · 2025-07-18T03:13:54Z

Unit test results varied across different operating systems (e.g., Linux, macOS, Windows), primarily due to:

mockedOSUtils.when(() -> OSUtils.exeCmd(Mockito.matches(".*pstree.*12345"))).thenReturn("1234 12345");

Should be updated to：

mockedOSUtils.when(() -> OSUtils.exeCmd(Mockito.matches(".*pstree.*12345"))).thenReturn("sudo(12345)---86.sh(1234)");

njnu-seafish · 2025-07-18T05:48:46Z

Quality Gate passed

Issues 2 New issues 0 Accepted issues

Measures 0 Security Hotspots 91.4% Coverage on New Code 0.0% Duplication on New Code

See analysis details on SonarQube Cloud

I will remove the "throws Exception" in the test

sonarqubecloud · 2025-07-18T07:03:01Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
91.4% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

ruanwenjun

LGTM

…ache#17320)

…hod (apache#17005) [Fix-17316][Task-API] Add check process status after killing task (apache#17320) [Fix-17436][Workflow]Task timeout kill throw exception (apache#17437 改造支持seatunnel集群模式停止任务

苏义超 added 2 commits July 3, 2025 20:35

add process status check after kill

a31e3f7

add log

8899394

njnu-seafish requested review from Gallardot, SbloodyS and caishunfeng as code owners July 3, 2025 13:00

github-actions bot added the backend label Jul 3, 2025

github-actions bot assigned njnu-seafish Jul 3, 2025

SbloodyS reviewed Jul 4, 2025

View reviewed changes

njnu-seafish requested a review from SbloodyS July 4, 2025 15:29

SbloodyS reviewed Jul 7, 2025

View reviewed changes

...r-task-api/src/main/java/org/apache/dolphinscheduler/plugin/task/api/utils/ProcessUtils.java Outdated Show resolved Hide resolved

SbloodyS added the bug Something isn't working label Jul 7, 2025

SbloodyS added this to the 3.3.1 milestone Jul 7, 2025

SbloodyS added the first time contributor First-time contributor label Jul 7, 2025

njnu-seafish and others added 3 commits July 7, 2025 13:53

Merge branch 'apache:dev' into Fix-17316

cf8b24c

add process.status.check.delay in the configuration file

cffb80b

Merge remote-tracking branch 'origin/Fix-17316' into Fix-17316

1805ed5

github-actions bot added kubernetes test e2e e2e test document labels Jul 7, 2025

苏义超 added 2 commits July 7, 2025 15:08

update README.md

a7ccbb6

update README.md

bb1777a

SbloodyS reviewed Jul 7, 2025

View reviewed changes

update log

455f681

njnu-seafish requested a review from SbloodyS July 7, 2025 12:25

njnu-seafish changed the title ~~[Fix-17316][Task-API] Add check process status after kill~~ [Fix-17316][Task-API] Add check process status after killing task Jul 7, 2025

SbloodyS reviewed Jul 8, 2025

View reviewed changes

...r-task-api/src/main/java/org/apache/dolphinscheduler/plugin/task/api/utils/ProcessUtils.java Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

苏义超 added 2 commits July 16, 2025 09:19

Merge remote-tracking branch 'origin/Fix-17316' into Fix-17316

110813f

simplify PID string parsing by using streams

9467f46

njnu-seafish dismissed SbloodyS’s stale review via 9467f46 July 16, 2025 01:45

苏义超 added 2 commits July 16, 2025 11:36

update test

0ad1c1c

update test

51c8405

njnu-seafish requested review from SbloodyS and ruanwenjun July 16, 2025 03:46

update test

9ea5d5c

github-advanced-security bot found potential problems Jul 17, 2025

View reviewed changes

...r-task-api/src/main/java/org/apache/dolphinscheduler/plugin/task/api/utils/ProcessUtils.java Dismissed Show dismissed Hide dismissed

ruanwenjun previously approved these changes Jul 17, 2025

View reviewed changes

ruanwenjun requested a review from Copilot July 17, 2025 08:40

Copilot AI reviewed Jul 17, 2025

View reviewed changes

update test

37d5f5f

njnu-seafish dismissed ruanwenjun’s stale review via 37d5f5f July 17, 2025 12:20

update test

555422f

njnu-seafish requested a review from ruanwenjun July 18, 2025 02:48

苏义超 added 2 commits July 18, 2025 10:58

update test by copilot

a2364ba

update chart

efd7fc9

remove the throws Exception in the test

61a755b

ruanwenjun approved these changes Jul 18, 2025

View reviewed changes

SbloodyS approved these changes Jul 18, 2025

View reviewed changes

SbloodyS merged commit e8f8a8c into apache:dev Jul 18, 2025
73 of 96 checks passed

eco8848 pushed a commit to eco8848/dolphinscheduler that referenced this pull request Aug 8, 2025

[Fix-17316][Task-API] Add check process status after killing task (ap…

b753495

…ache#17320)

davidzollo pushed a commit to davidzollo/dolphinscheduler that referenced this pull request Oct 27, 2025

[Fix-17316][Task-API] Add check process status after killing task (ap…

c4d696b

…ache#17320)

ruanwenjun mentioned this pull request Dec 15, 2025

[Bug] [SeaTunnel] SeaTunnel streaming job status incorrect & task cannot be stopped from DolphinScheduler #17786

Open

3 tasks

		// Remove if process is no longer alive
		alivePidList.removeIf(pid -> !isProcessAlive(pid, tenantCode));

-                // Remove if process is no longer alive
-                alivePidList.removeIf(pid -> !isProcessAlive(pid, tenantCode));
+                // Collect PIDs that are still alive
+                List<Integer> stillAlivePids = alivePidList.stream()
+                        .filter(pid -> isProcessAlive(pid, tenantCode))
+                        .collect(Collectors.toList());
+                alivePidList = stillAlivePids;

[Fix-17316][Task-API] Add check process status after killing task #17320

[Fix-17316][Task-API] Add check process status after killing task #17320

Uh oh!

Conversation

njnu-seafish commented Jul 3, 2025

Purpose of the pull request

Brief change log

Verify this pull request

Pull Request Notice

Uh oh!

SbloodyS left a comment

Choose a reason for hiding this comment

Uh oh!

njnu-seafish commented Jul 4, 2025

Uh oh!

njnu-seafish commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SbloodyS commented Jul 8, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

ruanwenjun left a comment

Choose a reason for hiding this comment

Uh oh!

ruanwenjun Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

njnu-seafish commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

njnu-seafish commented Jul 18, 2025

Quality Gate passed

Uh oh!

sonarqubecloud bot commented Jul 18, 2025

Quality Gate passed

Uh oh!

ruanwenjun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

njnu-seafish commented Jul 4, 2025 •

edited

Loading

njnu-seafish commented Jul 18, 2025 •

edited

Loading