Fix conditions related to os kill in local cluster status so its not marked complete/down while running by deep1401 · Pull Request #1402 · transformerlab/transformerlab-app

deep1401 · 2026-02-25T20:39:51Z

Summary by CodeRabbit

Bug Fixes
- Improved cluster status detection to better identify zombie processes and prevent false "running" status reports.
- Changed default cluster status to "Unknown" when starting (instead of "Down") for more accurate startup state reporting.
Chores
- Removed verbose debug logging statements for cleaner console output.

…te/down while running

coderabbitai · 2026-02-25T20:40:14Z

📝 Walkthrough

Walkthrough

The changes improve process lifecycle management by adding zombie process detection in the local compute provider, changing default cluster status from DOWN to UNKNOWN when no pid file exists, and reducing debug logging noise across the codebase.

Changes

Cohort / File(s)	Summary
Process Lifecycle & Zombie Detection `api/transformerlab/compute_providers/local.py`	Introduces `_is_process_zombie(pid)` helper using psutil to detect defunct processes. Updates `get_cluster_status` to verify both process existence and zombie status via `os.kill(pid, 0)` before marking UP. Changes default status from DOWN to UNKNOWN when no pid file exists, with message indicating potential startup in progress.
Debug Logging Cleanup `api/transformerlab/compute_providers/local.py`, `api/transformerlab/services/local_provider_queue.py`	Removes multiple noisy DEBUG print statements around cluster launch, setup execution, and command execution. Reduces logging verbosity without altering functional behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰✨ A zombie process hopped our way,
But our psutil friend saves the day!
Now we detect the undead with care,
And UNKNOWN states float in the air—
Cleaner logs and smarter status, hooray! 🎉

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly relates to the main changes in the pull request. The updates to get_cluster_status, zombie process detection, and improved status handling address the core issue of ensuring the cluster is not incorrectly marked as complete/down while running.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/os-kill-condition

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/transformerlab/compute_providers/local.py`:
- Around line 269-276: The current liveness check only tests PID existence
(os.kill(pid, 0) and _is_process_zombie) which is vulnerable to PID reuse;
update the check in the status path that returns ClusterStatus/ClusterState.UP
to validate process identity by comparing a persistent launch fingerprint (e.g.,
store launch metadata when starting the cluster such as pid and
create_time/session id) with the live process info (use
psutil.Process(pid).create_time() or equivalent) before returning UP; if the
create_time/session metadata does not match, treat the cluster as not running
and return an appropriate non-UP status. Ensure this validation is added
alongside the existing os.kill/_is_process_zombie checks and that the stored
launch fingerprint is written when the cluster is created.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4aebbb7 and 4cadb89.

📒 Files selected for processing (2)

api/transformerlab/compute_providers/local.py
api/transformerlab/services/local_provider_queue.py

💤 Files with no reviewable changes (1)

api/transformerlab/services/local_provider_queue.py

coderabbitai · 2026-02-25T20:43:25Z

api/transformerlab/compute_providers/local.py

+            os.kill(pid, 0)
+            if _is_process_zombie(pid):
+                raise ProcessLookupError("Process is zombie/defunct")
+            return ClusterStatus(
+                cluster_name=cluster_name,
+                state=ClusterState.UP,
+                status_message="Process running",
+            )


⚠️ Potential issue | 🟠 Major

PID reuse can still cause false UP status

On Line 269 and Line 272, liveness is inferred from PID existence. If the original process exits and that PID gets reused, an unrelated process can be reported as this cluster.
Please persist and validate process identity (e.g., launch-time fingerprint such as create time/session metadata) before returning ClusterState.UP.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@api/transformerlab/compute_providers/local.py` around lines 269 - 276, The current liveness check only tests PID existence (os.kill(pid, 0) and _is_process_zombie) which is vulnerable to PID reuse; update the check in the status path that returns ClusterStatus/ClusterState.UP to validate process identity by comparing a persistent launch fingerprint (e.g., store launch metadata when starting the cluster such as pid and create_time/session id) with the live process info (use psutil.Process(pid).create_time() or equivalent) before returning UP; if the create_time/session metadata does not match, treat the cluster as not running and return an appropriate non-UP status. Ensure this validation is added alongside the existing os.kill/_is_process_zombie checks and that the stored launch fingerprint is written when the cluster is created.

sentry · 2026-02-25T20:44:31Z

Codecov Report

❌ Patch coverage is 5.26316% with 18 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
api/transformerlab/compute_providers/local.py	5.26%	18 Missing ⚠️

📢 Thoughts on this report? Let us know!

fix conditions related to os kill in run job so its not marked comple…

4cadb89

…te/down while running

dadmobile approved these changes Feb 25, 2026

View reviewed changes

deep1401 merged commit 1395486 into main Feb 25, 2026
8 of 9 checks passed

coderabbitai bot reviewed Feb 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix conditions related to os kill in local cluster status so its not marked complete/down while running#1402

Fix conditions related to os kill in local cluster status so its not marked complete/down while running#1402
deep1401 merged 1 commit intomainfrom
fix/os-kill-condition

deep1401 commented Feb 25, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 25, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 25, 2026

Uh oh!

sentry bot commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

deep1401 commented Feb 25, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

sentry bot commented Feb 25, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

deep1401 commented Feb 25, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 25, 2026 •

edited

Loading