-
Notifications
You must be signed in to change notification settings - Fork 5k
[Fix-17316][Task-API] Add check process status after killing task #17320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
SbloodyS
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested this issue #17316 base on your reproduction steps in macos and ubuntu 24.04 and found this issue does not exist, please describe the specific reproduction environment. @njnu-seafish
Sorry, I was very busy during the day and didn't have time to reply. [dolphinscheduler@xxxxx][~] [dolphinscheduler@xxxxx][~] [dolphinscheduler@xxxxx][~] [dolphinscheduler@xxxxx][~] [dolphinscheduler@xxxxx][~] // kill -s SIGINT return 0. However, the process was not actually killed. [dolphinscheduler@xxxxx][~] // wating 5m [dolphinscheduler@xxxxx][~] [dolphinscheduler@xxxxx][~] ` |
After fixing the bug according to the code above, the log of the task instance is as follows. ****************************** Script Content ***************************************************************** #!/bin/bash BASEDIR=$(cd cd $BASEDIR source /usr/local/dolphinscheduler/bin/env/dolphinscheduler_env.sh echo ${JAVE_HOME} sleep 10m ****************************** Script Content ***************************************************************** Executing shell command : sudo -u dolphinscheduler -i /data01/dolphinscheduler/exec/process/130/130.sh process start, process id is: 643464 Begin killing task instance, processId: 643464 prepare to parse pid, raw pid string: sudo(643464)---130.sh(643479)---sleep(643559) Kill command: kill -s SIGINT 643464 643479 643559, trying to terminate process Kill command: sudo -u dolphinscheduler -i kill -s SIGINT 643464 643479 643559, kill failed, the process: 643464 is still running Kill command: kill -s SIGTERM 643464 643479 643559, trying to terminate process Kill command: sudo -u dolphinscheduler -i kill -s SIGTERM 643464 643479 643559, kill succeeded Successfully killed process tree using SIGTERM, processId: 643464 Process tree for task: 130 is killed or already finished, pid: 643464 ` |
...r-task-api/src/main/java/org/apache/dolphinscheduler/plugin/task/api/utils/ProcessUtils.java
Outdated
Show resolved
Hide resolved
...eduler-e2e/dolphinscheduler-e2e-case/src/test/resources/docker/file-manage/common.properties
Outdated
Show resolved
Hide resolved
...r-task-api/src/main/java/org/apache/dolphinscheduler/plugin/task/api/utils/ProcessUtils.java
Show resolved
Hide resolved
...r-task-api/src/main/java/org/apache/dolphinscheduler/plugin/task/api/utils/ProcessUtils.java
Outdated
Show resolved
Hide resolved
...r-task-api/src/main/java/org/apache/dolphinscheduler/plugin/task/api/utils/ProcessUtils.java
Outdated
Show resolved
Hide resolved
...r-task-api/src/main/java/org/apache/dolphinscheduler/plugin/task/api/utils/ProcessUtils.java
Outdated
Show resolved
Hide resolved
...r-task-api/src/main/java/org/apache/dolphinscheduler/plugin/task/api/utils/ProcessUtils.java
Show resolved
Hide resolved
|
Please run |
...r-task-api/src/main/java/org/apache/dolphinscheduler/plugin/task/api/utils/ProcessUtils.java
Dismissed
Show dismissed
Hide dismissed
ruanwenjun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| # This is the chart version. This version number should be incremented each time you make changes | ||
| # to the chart and its templates, including the app version. | ||
| version: 3.1.0 | ||
| version: 3.1.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request enhances the process termination mechanism in DolphinScheduler's Task API by adding proper process status verification after sending kill signals. The enhancement introduces a configurable timeout mechanism that waits for processes to terminate gracefully before escalating to more forceful signals.
- Adds process status verification using
kill -0to check if processes are actually terminated after sending kill signals - Introduces a configurable timeout property
shell.kill.wait.timeout(default 10 seconds) for waiting between signal escalation - Implements improved test coverage for the new process killing logic with different scenarios
Reviewed Changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| ProcessUtils.java | Core logic enhancement with process status checking and timeout-based waiting mechanism |
| ProcessUtilsTest.java | New test cases covering success scenarios with no alive PIDs, SIGINT success, and kill failure cases |
| Constants.java | Added new constant for shell kill wait timeout configuration |
| common.properties (multiple) | Configuration files updated with the new timeout property |
| values.yaml | Kubernetes deployment configuration updated with new timeout setting |
| README.md | Documentation updated to include new configuration parameter |
| Chart.yaml | Version bump from 3.1.0 to 3.1.1 |
| while (!alivePidList.isEmpty() && (System.currentTimeMillis() - startTime < timeoutMillis)) { | ||
| // Remove if process is no longer alive | ||
| alivePidList.removeIf(pid -> !isProcessAlive(pid, tenantCode)); | ||
| if (!alivePidList.isEmpty()) { | ||
| // Wait for a short interval before checking process statuses again, to avoid excessive CPU usage | ||
| // from tight-loop polling. | ||
| ThreadUtils.sleep(SLEEP_TIME_MILLIS); |
Copilot
AI
Jul 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The polling mechanism uses a fixed sleep interval which may not be optimal. Consider using exponential backoff or a configurable polling interval to balance responsiveness with CPU usage.
| while (!alivePidList.isEmpty() && (System.currentTimeMillis() - startTime < timeoutMillis)) { | |
| // Remove if process is no longer alive | |
| alivePidList.removeIf(pid -> !isProcessAlive(pid, tenantCode)); | |
| if (!alivePidList.isEmpty()) { | |
| // Wait for a short interval before checking process statuses again, to avoid excessive CPU usage | |
| // from tight-loop polling. | |
| ThreadUtils.sleep(SLEEP_TIME_MILLIS); | |
| long initialSleepMillis = 100; // Start with 100ms | |
| long maxSleepMillis = 2000; // Cap the sleep interval at 2000ms | |
| long currentSleepMillis = initialSleepMillis; | |
| while (!alivePidList.isEmpty() && (System.currentTimeMillis() - startTime < timeoutMillis)) { | |
| // Remove if process is no longer alive | |
| alivePidList.removeIf(pid -> !isProcessAlive(pid, tenantCode)); | |
| if (!alivePidList.isEmpty()) { | |
| // Wait for a dynamically adjusted interval before checking process statuses again | |
| ThreadUtils.sleep(currentSleepMillis); | |
| // Increase the sleep interval exponentially, up to the maximum limit | |
| currentSleepMillis = Math.min(currentSleepMillis * 2, maxSleepMillis); | |
| } else { | |
| // Reset the sleep interval if all processes are terminated | |
| currentSleepMillis = initialSleepMillis; |
| // Initialize a counter to track how many times the method is invoked | ||
| AtomicInteger callCount = new AtomicInteger(0); | ||
| // Mock the static method OSUtils.exeCmd that matches "kill -0" command | ||
| mockedOSUtils.when(() -> OSUtils.exeCmd(Mockito.matches(".*kill -0.*"))) | ||
| .thenAnswer(invocation -> { | ||
| int count = callCount.incrementAndGet(); | ||
| // these calls will succeed (simulate process is alive) | ||
| if (count == 1 || count == 2) { | ||
| return ""; | ||
| } else { | ||
| throw new RuntimeException("Command failed"); | ||
| } | ||
| }); |
Copilot
AI
Jul 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The AtomicInteger with hardcoded call count logic (count == 1 || count == 2) makes the test brittle and difficult to understand. Consider using a more explicit mock setup with predefined responses.
| // Initialize a counter to track how many times the method is invoked | |
| AtomicInteger callCount = new AtomicInteger(0); | |
| // Mock the static method OSUtils.exeCmd that matches "kill -0" command | |
| mockedOSUtils.when(() -> OSUtils.exeCmd(Mockito.matches(".*kill -0.*"))) | |
| .thenAnswer(invocation -> { | |
| int count = callCount.incrementAndGet(); | |
| // these calls will succeed (simulate process is alive) | |
| if (count == 1 || count == 2) { | |
| return ""; | |
| } else { | |
| throw new RuntimeException("Command failed"); | |
| } | |
| }); | |
| // Mock the static method OSUtils.exeCmd that matches "kill -0" command | |
| mockedOSUtils.when(() -> OSUtils.exeCmd(Mockito.matches(".*kill -0.*"))) | |
| .thenReturn("") // First invocation succeeds (process is alive) | |
| .thenReturn("") // Second invocation succeeds (process is alive) | |
| .thenThrow(new RuntimeException("Command failed")); // Subsequent invocations fail (process is no longer alive) |
| // Remove if process is no longer alive | ||
| alivePidList.removeIf(pid -> !isProcessAlive(pid, tenantCode)); |
Copilot
AI
Jul 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The while loop modifies alivePidList during iteration with removeIf(). While this is safe for ArrayList, it could be made more explicit by collecting results separately to avoid potential confusion.
| // Remove if process is no longer alive | |
| alivePidList.removeIf(pid -> !isProcessAlive(pid, tenantCode)); | |
| // Collect PIDs that are still alive | |
| List<Integer> stillAlivePids = alivePidList.stream() | |
| .filter(pid -> isProcessAlive(pid, tenantCode)) | |
| .collect(Collectors.toList()); | |
| alivePidList = stillAlivePids; |
|
Unit test results varied across different operating systems (e.g., Linux, macOS, Windows), primarily due to: mockedOSUtils.when(() -> OSUtils.exeCmd(Mockito.matches(".*pstree.*12345"))).thenReturn("1234 12345"); Should be updated to: mockedOSUtils.when(() -> OSUtils.exeCmd(Mockito.matches(".*pstree.*12345"))).thenReturn("sudo(12345)---86.sh(1234)"); |
I will remove the "throws Exception" in the test |
|
ruanwenjun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…hod (apache#17005) [Fix-17316][Task-API] Add check process status after killing task (apache#17320) [Fix-17436][Workflow]Task timeout kill throw exception (apache#17437 改造支持seatunnel集群模式停止任务



Purpose of the pull request
close #17316
Brief change log
after kill the process, and check process status
Verify this pull request
This pull request is code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(or)
Pull Request Notice
Pull Request Notice
If your pull request contains incompatible change, you should also add it to
docs/docs/en/guide/upgrade/incompatible.md