Skip to content

Conversation

@njnu-seafish
Copy link
Contributor

Purpose of the pull request

close #17316

Brief change log

after kill the process, and check process status

Verify this pull request

This pull request is code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(or)

Pull Request Notice

Pull Request Notice

If your pull request contains incompatible change, you should also add it to docs/docs/en/guide/upgrade/incompatible.md

Copy link
Member

@SbloodyS SbloodyS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested this issue #17316 base on your reproduction steps in macos and ubuntu 24.04 and found this issue does not exist, please describe the specific reproduction environment. @njnu-seafish

@njnu-seafish
Copy link
Contributor Author

I've tested this issue #17316 base on your reproduction steps in macos and ubuntu 24.04 and found this issue does not exist, please describe the specific reproduction environment. @njnu-seafish

Sorry, I was very busy during the day and didn't have time to reply.
My environment is centos7.
Below is an example I tested directly on my machine.
`
[dolphinscheduler@xxxxx][~]
$ more test.sh
echo ${JAVE_HOME};
sleep 10m

[dolphinscheduler@xxxxx][~]
$ date
Fri Jul 4 22:59:27 CST 2025

[dolphinscheduler@xxxxx][~]
$ sudo -u dolphinscheduler -i sh test.sh

[dolphinscheduler@xxxxx][~]
$ ps -ef | grep -i test.sh
root 1259617 1258564 0 22:59 pts/5 00:00:00 sudo -u dolphinscheduler -i sh test.sh

[dolphinscheduler@xxxxx][~]
$ pstree -p 1259617
sudo(1259617)───sh(1259618)───sleep(1259696)

[dolphinscheduler@xxxxx][~]
$ pstree -p 1259617
sudo(1259617)───sh(1259618)───sleep(1259696)

// kill -s SIGINT return 0. However, the process was not actually killed.
[dolphinscheduler@xxxxx][~]
$ sudo -u dolphinscheduler kill -s SIGINT 1259617 1259618 1259696

[dolphinscheduler@xxxxx][~]
$ echo $?
0

// wating 5m
[dolphinscheduler@xxxxx][~]
$ date
Fri Jul 4 23:04:19 CST 2025

[dolphinscheduler@xxxxx][~]
$ ps -ef | grep -i test.sh
root 1259617 1258564 0 22:59 pts/5 00:00:00 sudo -u dolphinscheduler -i sh test.sh

[dolphinscheduler@xxxxx][~]
$ pstree -p 1259617
sudo(1259617)───sh(1259618)───sleep(1259696)

`

@njnu-seafish
Copy link
Contributor Author

njnu-seafish commented Jul 4, 2025

I've tested this issue #17316 base on your reproduction steps in macos and ubuntu 24.04 and found this issue does not exist, please describe the specific reproduction environment. @njnu-seafish

After fixing the bug according to the code above, the log of the task instance is as follows.
`
Final Shell file is:

****************************** Script Content *****************************************************************

#!/bin/bash

BASEDIR=$(cd dirname $0; pwd)

cd $BASEDIR

source /usr/local/dolphinscheduler/bin/env/dolphinscheduler_env.sh

echo ${JAVE_HOME}

sleep 10m

****************************** Script Content *****************************************************************

Executing shell command : sudo -u dolphinscheduler -i /data01/dolphinscheduler/exec/process/130/130.sh

process start, process id is: 643464

Begin killing task instance, processId: 643464

prepare to parse pid, raw pid string: sudo(643464)---130.sh(643479)---sleep(643559)

Kill command: kill -s SIGINT 643464 643479 643559, trying to terminate process

Kill command: sudo -u dolphinscheduler -i kill -s SIGINT 643464 643479 643559, kill failed, the process: 643464 is still running

Kill command: kill -s SIGTERM 643464 643479 643559, trying to terminate process

Kill command: sudo -u dolphinscheduler -i kill -s SIGTERM 643464 643479 643559, kill succeeded

Successfully killed process tree using SIGTERM, processId: 643464

Process tree for task: 130 is killed or already finished, pid: 643464

`

@njnu-seafish njnu-seafish requested a review from SbloodyS July 4, 2025 15:29
@SbloodyS SbloodyS added the bug Something isn't working label Jul 7, 2025
@SbloodyS SbloodyS added this to the 3.3.1 milestone Jul 7, 2025
@SbloodyS SbloodyS added the first time contributor First-time contributor label Jul 7, 2025
@njnu-seafish njnu-seafish requested a review from SbloodyS July 7, 2025 12:25
@njnu-seafish njnu-seafish changed the title [Fix-17316][Task-API] Add check process status after kill [Fix-17316][Task-API] Add check process status after killing task Jul 7, 2025
@SbloodyS
Copy link
Member

SbloodyS commented Jul 8, 2025

Please run mvn spotless:apply to format the code. @njnu-seafish

This comment was marked as outdated.

ruanwenjun
ruanwenjun previously approved these changes Jul 17, 2025
Copy link
Member

@ruanwenjun ruanwenjun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

# This is the chart version. This version number should be incremented each time you make changes
# to the chart and its templates, including the app version.
version: 3.1.0
version: 3.1.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change this?

@ruanwenjun ruanwenjun requested a review from Copilot July 17, 2025 08:40
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request enhances the process termination mechanism in DolphinScheduler's Task API by adding proper process status verification after sending kill signals. The enhancement introduces a configurable timeout mechanism that waits for processes to terminate gracefully before escalating to more forceful signals.

  • Adds process status verification using kill -0 to check if processes are actually terminated after sending kill signals
  • Introduces a configurable timeout property shell.kill.wait.timeout (default 10 seconds) for waiting between signal escalation
  • Implements improved test coverage for the new process killing logic with different scenarios

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
ProcessUtils.java Core logic enhancement with process status checking and timeout-based waiting mechanism
ProcessUtilsTest.java New test cases covering success scenarios with no alive PIDs, SIGINT success, and kill failure cases
Constants.java Added new constant for shell kill wait timeout configuration
common.properties (multiple) Configuration files updated with the new timeout property
values.yaml Kubernetes deployment configuration updated with new timeout setting
README.md Documentation updated to include new configuration parameter
Chart.yaml Version bump from 3.1.0 to 3.1.1

Comment on lines +190 to +196
while (!alivePidList.isEmpty() && (System.currentTimeMillis() - startTime < timeoutMillis)) {
// Remove if process is no longer alive
alivePidList.removeIf(pid -> !isProcessAlive(pid, tenantCode));
if (!alivePidList.isEmpty()) {
// Wait for a short interval before checking process statuses again, to avoid excessive CPU usage
// from tight-loop polling.
ThreadUtils.sleep(SLEEP_TIME_MILLIS);
Copy link

Copilot AI Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The polling mechanism uses a fixed sleep interval which may not be optimal. Consider using exponential backoff or a configurable polling interval to balance responsiveness with CPU usage.

Suggested change
while (!alivePidList.isEmpty() && (System.currentTimeMillis() - startTime < timeoutMillis)) {
// Remove if process is no longer alive
alivePidList.removeIf(pid -> !isProcessAlive(pid, tenantCode));
if (!alivePidList.isEmpty()) {
// Wait for a short interval before checking process statuses again, to avoid excessive CPU usage
// from tight-loop polling.
ThreadUtils.sleep(SLEEP_TIME_MILLIS);
long initialSleepMillis = 100; // Start with 100ms
long maxSleepMillis = 2000; // Cap the sleep interval at 2000ms
long currentSleepMillis = initialSleepMillis;
while (!alivePidList.isEmpty() && (System.currentTimeMillis() - startTime < timeoutMillis)) {
// Remove if process is no longer alive
alivePidList.removeIf(pid -> !isProcessAlive(pid, tenantCode));
if (!alivePidList.isEmpty()) {
// Wait for a dynamically adjusted interval before checking process statuses again
ThreadUtils.sleep(currentSleepMillis);
// Increase the sleep interval exponentially, up to the maximum limit
currentSleepMillis = Math.min(currentSleepMillis * 2, maxSleepMillis);
} else {
// Reset the sleep interval if all processes are terminated
currentSleepMillis = initialSleepMillis;

Copilot uses AI. Check for mistakes.
Comment on lines 161 to 173
// Initialize a counter to track how many times the method is invoked
AtomicInteger callCount = new AtomicInteger(0);
// Mock the static method OSUtils.exeCmd that matches "kill -0" command
mockedOSUtils.when(() -> OSUtils.exeCmd(Mockito.matches(".*kill -0.*")))
.thenAnswer(invocation -> {
int count = callCount.incrementAndGet();
// these calls will succeed (simulate process is alive)
if (count == 1 || count == 2) {
return "";
} else {
throw new RuntimeException("Command failed");
}
});
Copy link

Copilot AI Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AtomicInteger with hardcoded call count logic (count == 1 || count == 2) makes the test brittle and difficult to understand. Consider using a more explicit mock setup with predefined responses.

Suggested change
// Initialize a counter to track how many times the method is invoked
AtomicInteger callCount = new AtomicInteger(0);
// Mock the static method OSUtils.exeCmd that matches "kill -0" command
mockedOSUtils.when(() -> OSUtils.exeCmd(Mockito.matches(".*kill -0.*")))
.thenAnswer(invocation -> {
int count = callCount.incrementAndGet();
// these calls will succeed (simulate process is alive)
if (count == 1 || count == 2) {
return "";
} else {
throw new RuntimeException("Command failed");
}
});
// Mock the static method OSUtils.exeCmd that matches "kill -0" command
mockedOSUtils.when(() -> OSUtils.exeCmd(Mockito.matches(".*kill -0.*")))
.thenReturn("") // First invocation succeeds (process is alive)
.thenReturn("") // Second invocation succeeds (process is alive)
.thenThrow(new RuntimeException("Command failed")); // Subsequent invocations fail (process is no longer alive)

Copilot uses AI. Check for mistakes.
Comment on lines +191 to +192
// Remove if process is no longer alive
alivePidList.removeIf(pid -> !isProcessAlive(pid, tenantCode));
Copy link

Copilot AI Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The while loop modifies alivePidList during iteration with removeIf(). While this is safe for ArrayList, it could be made more explicit by collecting results separately to avoid potential confusion.

Suggested change
// Remove if process is no longer alive
alivePidList.removeIf(pid -> !isProcessAlive(pid, tenantCode));
// Collect PIDs that are still alive
List<Integer> stillAlivePids = alivePidList.stream()
.filter(pid -> isProcessAlive(pid, tenantCode))
.collect(Collectors.toList());
alivePidList = stillAlivePids;

Copilot uses AI. Check for mistakes.
@njnu-seafish njnu-seafish requested a review from ruanwenjun July 18, 2025 02:48
@njnu-seafish
Copy link
Contributor Author

njnu-seafish commented Jul 18, 2025

Unit test results varied across different operating systems (e.g., Linux, macOS, Windows), primarily due to:

mockedOSUtils.when(() -> OSUtils.exeCmd(Mockito.matches(".*pstree.*12345"))).thenReturn("1234 12345");

Should be updated to:

mockedOSUtils.when(() -> OSUtils.exeCmd(Mockito.matches(".*pstree.*12345"))).thenReturn("sudo(12345)---86.sh(1234)");

@njnu-seafish
Copy link
Contributor Author

Quality Gate Passed Quality Gate passed

Issues 2 New issues 0 Accepted issues

Measures 0 Security Hotspots 91.4% Coverage on New Code 0.0% Duplication on New Code

See analysis details on SonarQube Cloud

I will remove the "throws Exception" in the test

@sonarqubecloud
Copy link

Copy link
Member

@ruanwenjun ruanwenjun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SbloodyS SbloodyS merged commit e8f8a8c into apache:dev Jul 18, 2025
73 of 96 checks passed
eco8848 pushed a commit to eco8848/dolphinscheduler that referenced this pull request Aug 8, 2025
davidzollo pushed a commit to davidzollo/dolphinscheduler that referenced this pull request Oct 27, 2025
LeonYoah added a commit to LeonYoah/dolphinscheduler that referenced this pull request Nov 1, 2025
…hod (apache#17005)

   [Fix-17316][Task-API] Add check process status after killing task (apache#17320)
   [Fix-17436][Workflow]Task timeout kill throw exception (apache#17437  改造支持seatunnel集群模式停止任务
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend bug Something isn't working document e2e e2e test first time contributor First-time contributor kubernetes test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] [Task-API] When the task is Kill, the task instance shows that the Kill is normal, but the task process is not Killed

4 participants