Skip to content

Fix conditions related to os kill in local cluster status so its not marked complete/down while running#1402

Merged
deep1401 merged 1 commit intomainfrom
fix/os-kill-condition
Feb 25, 2026
Merged

Fix conditions related to os kill in local cluster status so its not marked complete/down while running#1402
deep1401 merged 1 commit intomainfrom
fix/os-kill-condition

Conversation

@deep1401
Copy link
Copy Markdown
Member

@deep1401 deep1401 commented Feb 25, 2026

Summary by CodeRabbit

  • Bug Fixes

    • Improved cluster status detection to better identify zombie processes and prevent false "running" status reports.
    • Changed default cluster status to "Unknown" when starting (instead of "Down") for more accurate startup state reporting.
  • Chores

    • Removed verbose debug logging statements for cleaner console output.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 25, 2026

📝 Walkthrough

Walkthrough

The changes improve process lifecycle management by adding zombie process detection in the local compute provider, changing default cluster status from DOWN to UNKNOWN when no pid file exists, and reducing debug logging noise across the codebase.

Changes

Cohort / File(s) Summary
Process Lifecycle & Zombie Detection
api/transformerlab/compute_providers/local.py
Introduces _is_process_zombie(pid) helper using psutil to detect defunct processes. Updates get_cluster_status to verify both process existence and zombie status via os.kill(pid, 0) before marking UP. Changes default status from DOWN to UNKNOWN when no pid file exists, with message indicating potential startup in progress.
Debug Logging Cleanup
api/transformerlab/compute_providers/local.py, api/transformerlab/services/local_provider_queue.py
Removes multiple noisy DEBUG print statements around cluster launch, setup execution, and command execution. Reduces logging verbosity without altering functional behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰✨ A zombie process hopped our way,
But our psutil friend saves the day!
Now we detect the undead with care,
And UNKNOWN states float in the air—
Cleaner logs and smarter status, hooray! 🎉

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly relates to the main changes in the pull request. The updates to get_cluster_status, zombie process detection, and improved status handling address the core issue of ensuring the cluster is not incorrectly marked as complete/down while running.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/os-kill-condition

Comment @coderabbitai help to get the list of available commands and usage tips.

@deep1401 deep1401 merged commit 1395486 into main Feb 25, 2026
8 of 9 checks passed
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/transformerlab/compute_providers/local.py`:
- Around line 269-276: The current liveness check only tests PID existence
(os.kill(pid, 0) and _is_process_zombie) which is vulnerable to PID reuse;
update the check in the status path that returns ClusterStatus/ClusterState.UP
to validate process identity by comparing a persistent launch fingerprint (e.g.,
store launch metadata when starting the cluster such as pid and
create_time/session id) with the live process info (use
psutil.Process(pid).create_time() or equivalent) before returning UP; if the
create_time/session metadata does not match, treat the cluster as not running
and return an appropriate non-UP status. Ensure this validation is added
alongside the existing os.kill/_is_process_zombie checks and that the stored
launch fingerprint is written when the cluster is created.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4aebbb7 and 4cadb89.

📒 Files selected for processing (2)
  • api/transformerlab/compute_providers/local.py
  • api/transformerlab/services/local_provider_queue.py
💤 Files with no reviewable changes (1)
  • api/transformerlab/services/local_provider_queue.py

Comment on lines +269 to +276
os.kill(pid, 0)
if _is_process_zombie(pid):
raise ProcessLookupError("Process is zombie/defunct")
return ClusterStatus(
cluster_name=cluster_name,
state=ClusterState.UP,
status_message="Process running",
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

PID reuse can still cause false UP status

On Line 269 and Line 272, liveness is inferred from PID existence. If the original process exits and that PID gets reused, an unrelated process can be reported as this cluster.
Please persist and validate process identity (e.g., launch-time fingerprint such as create time/session metadata) before returning ClusterState.UP.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/transformerlab/compute_providers/local.py` around lines 269 - 276, The
current liveness check only tests PID existence (os.kill(pid, 0) and
_is_process_zombie) which is vulnerable to PID reuse; update the check in the
status path that returns ClusterStatus/ClusterState.UP to validate process
identity by comparing a persistent launch fingerprint (e.g., store launch
metadata when starting the cluster such as pid and create_time/session id) with
the live process info (use psutil.Process(pid).create_time() or equivalent)
before returning UP; if the create_time/session metadata does not match, treat
the cluster as not running and return an appropriate non-UP status. Ensure this
validation is added alongside the existing os.kill/_is_process_zombie checks and
that the stored launch fingerprint is written when the cluster is created.

@sentry
Copy link
Copy Markdown

sentry bot commented Feb 25, 2026

Codecov Report

❌ Patch coverage is 5.26316% with 18 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
api/transformerlab/compute_providers/local.py 5.26% 18 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants