Skip to content

Update job status to FAILED for a crash detected by tfl remote trap#1432

Merged
deep1401 merged 7 commits intomainfrom
fix/remote-crash-running-state
Mar 3, 2026
Merged

Update job status to FAILED for a crash detected by tfl remote trap#1432
deep1401 merged 7 commits intomainfrom
fix/remote-crash-running-state

Conversation

@deep1401
Copy link
Copy Markdown
Member

@deep1401 deep1401 commented Mar 3, 2026

Summary by CodeRabbit

  • Chores

    • Project version updated to 0.0.86
    • Updated transformerlab dependency to 0.0.86
  • Bug Fixes

    • Improved live status handling for crashed jobs so failures are mirrored to job status and marked as FAILED

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 3, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8f1ccfd and 220ab14.

📒 Files selected for processing (2)
  • api/pyproject.toml
  • lab-sdk/pyproject.toml
🚧 Files skipped from review as they are similar to previous changes (2)
  • api/pyproject.toml
  • lab-sdk/pyproject.toml

📝 Walkthrough

Walkthrough

Version bumps: api and lab-sdk bumped from 0.0.85 → 0.0.86. Remote trap update: _set_live_status_async docstring expanded and when live_status == "crashed" it now also sets job status to "FAILED".

Changes

Cohort / File(s) Summary
Version files
api/pyproject.toml, lab-sdk/pyproject.toml
Project/dependency version incremented from 0.0.85 to 0.0.86.
Remote Trap Logic
lab-sdk/src/lab/remote_trap.py
Docstring expanded for _set_live_status_async and logic added to mark job status as "FAILED" when live_status is "crashed".

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Suggested labels

mode:multiuser

Suggested reviewers

  • aliasaria
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: updating job status to FAILED when a crash is detected by the remote trap, which is reflected in the lab-sdk/src/lab/remote_trap.py changes.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/remote-crash-running-state

Comment @coderabbitai help to get the list of available commands and usage tips.

@paragon-review
Copy link
Copy Markdown

paragon-review bot commented Mar 3, 2026

Paragon Summary

This pull request review identified 1 issue across 1 category in 3 files. The review analyzed code changes, potential bugs, security vulnerabilities, performance issues, and code quality concerns using automated analysis tools.

This PR enhances crash detection by updating job status to FAILED when crashes are caught by the tfl remote trap mechanism, with updates to the SDK's remote_trap implementation and version bumps in both the API and SDK packages.

Key changes:

  • Updates job status to FAILED when crashes are detected by tfl remote trap
  • Modifies lab-sdk/src/lab/remote_trap.py for crash detection logic
  • Updates dependencies in both api/pyproject.toml and lab-sdk/pyproject.toml

Confidence score: 5/5

  • This PR has low risk with no critical or high-priority issues identified
  • Score reflects clean code review with only minor suggestions or no issues found
  • Code quality checks passed - safe to proceed with merge

3 files reviewed, 1 comment

Severity breakdown: Low: 1


Tip: @paragon-run <instructions> to chat with our agent or push fixes!

Dashboard

job = await Job.get(job_id)
if job is None:
return
await job.update_job_data_field("live_status", status)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Crash status writes can partially fail silently

Crash status writes can partially fail silently. Job may show live_status=crashed without FAILED status. Consider writing update_status before update_job_data_field.

View Details

Location: lab-sdk/src/lab/remote_trap.py (lines 18)

Analysis

Crash status writes can partially fail silently. Job may show live_status=crashed without FAILED sta

What fails If update_job_data_field succeeds but update_status('FAILED') throws, the blanket except silently returns — leaving live_status=crashed without a matching FAILED job status.
Result Job has live_status=crashed but top-level job status is not set to FAILED, creating inconsistent state.
Expected Both live_status and job status should reflect the crash consistently, or the more important status (FAILED) should be written first.
Impact Minor: Under the existing best-effort design this is acceptable, but the job could appear stuck in monitoring UIs that key on job status rather than live_status.
How to reproduce
1. Trigger a remote job crash (nonzero exit code)
2. Simulate update_status raising an exception (e.g., network or storage error)
3. Observe job state
AI Fix Prompt
Fix this issue: Crash status writes can partially fail silently. Job may show live_status=crashed without FAILED status. Consider writing update_status before update_job_data_field.

Location: lab-sdk/src/lab/remote_trap.py (lines 18)
Problem: If update_job_data_field succeeds but update_status('FAILED') throws, the blanket except silently returns — leaving live_status=crashed without a matching FAILED job status.
Current behavior: Job has live_status=crashed but top-level job status is not set to FAILED, creating inconsistent state.
Expected: Both live_status and job status should reflect the crash consistently, or the more important status (FAILED) should be written first.
Steps to reproduce: 1. Trigger a remote job crash (nonzero exit code)
2. Simulate update_status raising an exception (e.g., network or storage error)
3. Observe job state

Provide a code fix.


Tip: Reply with @paragon-run to automatically fix this issue

@sentry
Copy link
Copy Markdown

sentry bot commented Mar 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@deep1401 deep1401 merged commit 52c9dbd into main Mar 3, 2026
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants