Remote trap to indicate live status of a command running irrespective of lab-sdk usage by deep1401 · Pull Request #1305 · transformerlab/transformerlab-app

deep1401 · 2026-02-06T13:51:17Z

To test:

Run a task with or without lab-sdk is fine
Install the lab sdk version through this branch
It will show failures/when the command executes below Job Progress

Summary by CodeRabbit

Release Notes

New Features
- Job progress now displays live status updates, indicating whether jobs are started, running, or have crashed.
- Remote job execution includes automatic status tracking for improved visibility.
Chores
- Bumped version to 0.0.78.

paragon-review · 2026-02-06T13:51:20Z

Paragon Review Unavailable

Hi @deep1401! To enable Paragon reviews on this repository, please register at https://home.polarity.cc

Once registered, connect your GitHub account and Paragon will automatically review your pull requests.

sentry · 2026-02-06T13:54:44Z

Codecov Report

❌ Patch coverage is 0% with 16 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
api/transformerlab/routers/compute_provider.py	0.00%	16 Missing ⚠️

📢 Thoughts on this report? Let us know!

aliasaria · 2026-02-09T21:28:27Z

What's a good way to test this?

aliasaria · 2026-02-11T15:41:56Z

@aliasaria to test

How to test:

Run a job and it should show extra progress data
Try to run a job that crashes and see if it captures it

…ab-app into add/remote-trap

coderabbitai · 2026-02-18T16:27:27Z

📝 Walkthrough

Walkthrough

This PR introduces remote execution tracing and live status monitoring for distributed job execution. A new tfl-remote-trap wrapper script is added to track remote command execution, update job status through started/finished/crashed states, and display live status in the UI. Remote providers now ensure the transformerlab SDK is installed and wrap commands with execution monitoring.

Changes

Cohort / File(s)	Summary
Version Updates and Script Entry `api/pyproject.toml`, `lab-sdk/pyproject.toml`	Version bump from 0.0.77 to 0.0.78; new CLI script entry `tfl-remote-trap` added to lab-sdk.
Remote Execution Wrapper `lab-sdk/src/lab/remote_trap.py`	New module implementing tfl-remote-trap: wraps remote commands, manages job status transitions (started→finished/crashed), updates Job.live_status via async database operations, and handles event loop fallback logic.
Remote Provider Setup and Execution `api/transformerlab/routers/compute_provider.py`	Integrates remote command wrapping for non-local providers, appends pip install step for SDK dependency, wraps launch commands with tfl-remote-trap, extends status checker with crash detection path, and ensures consistent status tracking across job launch and sweep scenarios.
Live Status UI Display `src/renderer/components/Experiment/Tasks/JobProgress.tsx`	Adds renderLiveStatusSubtitle() helper to display job live_status (started, crashed, finished) as Typography subtitle in LAUNCHING/RUNNING job progress sections.

Sequence Diagram

sequenceDiagram
    participant User
    participant API as Compute Provider API
    participant Wrapper as tfl-remote-trap
    participant Command as Remote Command
    participant DB as Job Database
    participant UI as Job Progress UI

    User->>API: Launch remote job
    API->>DB: Create Job with live_status=null
    API->>Wrapper: Execute wrapped command (via tfl-remote-trap -- ...)
    
    Wrapper->>DB: Update live_status="started"
    DB-->>UI: Status change detected
    UI->>UI: Render "Job started" subtitle
    
    Wrapper->>Command: Execute actual command
    Command-->>Wrapper: Returns exit code
    
    alt Command Success (exit 0)
        Wrapper->>DB: Update live_status="finished"
    else Command Failed (exit ≠ 0)
        Wrapper->>DB: Update live_status="crashed"
    end
    
    DB-->>UI: Status change detected
    UI->>UI: Render final status subtitle
    
    API->>DB: Check job status, handle crashed state
    API-->>User: Job result with final status

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 Hops of joy across the wire,
Remote commands climb ever higher,
With trap-wrapped traces and status gleams,
Tracking jobs through digital streams,
From start to finish, no crashes unseen! 🎯

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: introducing a remote trap mechanism to track live status of commands running remotely, independent of lab-sdk usage. This aligns with the substantial changes across multiple files including the new remote_trap.py module and integration into the compute provider flow.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch add/remote-trap

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

api/transformerlab/routers/compute_provider.py (2)

1157-1171: ⚠️ Potential issue | 🟡 Minor

Missing tfl-remote-trap wrapper for sweep child jobs.

The sweep child jobs are also REMOTE jobs but the command is not wrapped with tfl-remote-trap. This means sweep child jobs won't have live_status tracking.

For consistency with the main launch path (lines 1533-1536), sweep jobs should also wrap their commands.

Proposed fix in _launch_sweep_jobs

Add wrapping logic before ClusterConfig creation around line 1155:

+                # Wrap command for remote providers (same as main launch path)
+                wrapped_command = command_with_secrets
+                if provider.type != ProviderType.LOCAL.value:
+                    wrapped_command = f"tfl-remote-trap -- {command_with_secrets}"
+
                 cluster_config = ClusterConfig(
                     cluster_name=formatted_cluster_name,
                     provider_name=provider_display_name,
                     provider_id=provider.id,
-                    command=command_with_secrets,
+                    command=wrapped_command,
                     setup=final_setup,

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@api/transformerlab/routers/compute_provider.py` around lines 1157 - 1171, In
_launch_sweep_jobs, the sweep child job commands are not wrapped with the
tfl-remote-trap wrapper like the main launch path, so update the logic before
creating the ClusterConfig to wrap command_with_secrets (and/or final_setup if
used) with the same wrapper used at the main launch (the tfl-remote-trap
wrapping applied at lines ~1533–1536); ensure you construct wrapped_command =
f"tfl-remote-trap {original_command}" (preserving any secret injection already
in command_with_secrets) and pass wrapped_command into ClusterConfig (instead of
command_with_secrets) so ClusterConfig.command carries the trap wrapper for
live_status tracking in _launch_sweep_jobs.

2422-2436: ⚠️ Potential issue | 🟡 Minor

Missing tfl-remote-trap wrapper for resumed checkpoint jobs.

Similar to sweep jobs, resumed checkpoint jobs are REMOTE jobs but don't wrap the command with tfl-remote-trap. This will result in inconsistent live_status tracking.

Proposed fix in resume_from_checkpoint

Add wrapping logic before ClusterConfig creation around line 2420:

+    # Wrap command for remote providers
+    wrapped_command = command
+    if provider.type != ProviderType.LOCAL.value:
+        wrapped_command = f"tfl-remote-trap -- {command}"
+
     cluster_config = ClusterConfig(
         cluster_name=formatted_cluster_name,
         provider_name=provider_display_name,
         provider_id=provider.id,
-        command=command,
+        command=wrapped_command,
         setup=final_setup,

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@api/transformerlab/routers/compute_provider.py` around lines 2422 - 2436,
Resume-from-checkpoint jobs are not wrapping the job command with the
tfl-remote-trap wrapper, causing inconsistent live_status; in the
resume_from_checkpoint flow before creating ClusterConfig (where ClusterConfig
is constructed with variables like formatted_cluster_name,
provider_display_name, provider.id, command, final_setup, env_vars, etc.) apply
the same wrapping logic used for sweep jobs: detect the resumed checkpoint
branch and prepend/wrap the existing command variable with the tfl-remote-trap
wrapper (or call the existing helper that does this) so the command passed into
ClusterConfig is the wrapped command.

🧹 Nitpick comments (3)

src/renderer/components/Experiment/Tasks/JobProgress.tsx (1)

111-141: Well-implemented live status display with clear state handling.

The function correctly handles the three live status states with appropriate colors (neutral for started/finished, danger for crashed). The null check ensures graceful fallback when live_status is unavailable.

Consider adding live_status to the JobData interface for better type safety:
 interface JobData {
   start_time?: string;
   end_time?: string;
   completion_status?: string;
   completion_details?: string;
+  live_status?: 'started' | 'crashed' | 'finished';
   [key: string]: any;
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/renderer/components/Experiment/Tasks/JobProgress.tsx` around lines 111 -
141, Add the missing live_status property to the JobData type so TypeScript can
type-check usages like job.job_data.live_status (used in
renderLiveStatusSubtitle); update the JobData interface/typedef to include
live_status with an appropriate union type (e.g., 'started' | 'crashed' |
'finished' | undefined/null) and adjust any places that construct or parse
job.job_data to satisfy the new property type.

lab-sdk/src/lab/remote_trap.py (1)

30-43: Consider using asyncio.get_running_loop() for Python 3.10+ compatibility.

asyncio.get_event_loop() emits a DeprecationWarning when called without a running event loop in Python 3.10+. The current code works but may generate warnings.

Proposed modernization

     try:
         asyncio.run(_set_live_status_async(job_id, status))
     except RuntimeError:
-        # Fallback in case an event loop already exists.
         try:
-            loop = asyncio.get_event_loop()
-            if loop.is_running():
-                # In the unlikely case we're already in an event loop, schedule the task
-                # but don't wait on it (best-effort update).
+            loop = asyncio.get_running_loop()
+            # Already in an event loop, schedule the task but don't wait (best-effort).
-                loop.create_task(_set_live_status_async(job_id, status))
-            else:
-                loop.run_until_complete(_set_live_status_async(job_id, status))
+            loop.create_task(_set_live_status_async(job_id, status))
         except Exception:
             return

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@lab-sdk/src/lab/remote_trap.py` around lines 30 - 43, The current fallback
uses asyncio.get_event_loop() which can emit DeprecationWarnings on Python
3.10+; update the fallback in the block that calls
asyncio.run(_set_live_status_async(job_id, status)) to prefer
asyncio.get_running_loop() and only call asyncio.get_event_loop() as a last
resort: attempt to retrieve the running loop with asyncio.get_running_loop(), if
it exists and is running schedule the task with
loop.create_task(_set_live_status_async(job_id, status)), otherwise use
loop.run_until_complete(_set_live_status_async(job_id, status)) (or if no
running loop is available, create one safely); keep the outer try/except
behavior and ensure exceptions still result in the existing silent return.

api/transformerlab/routers/compute_provider.py (1)

1673-1702: Crash detection logic is correct, but consider extracting to a service.

The live_status crash handling correctly marks the job as FAILED with proper end_time recording. However, this adds business logic directly in the router.

Per coding guidelines, business logic should be placed in api/transformerlab/services/ using the Service pattern. The existing code in this file follows a similar inline pattern, so this is consistent, but future refactoring could extract job status update logic to job_service.

Potential service extraction

Consider adding a helper in job_service.py:

async def mark_job_as_crashed(job_id: str, experiment_id: str, session: AsyncSession) -> dict:
    """Mark a job as FAILED due to remote command crash."""
    end_time_str = time.strftime("%Y-%m-%d %H:%M:%S", time.gmtime())
    await job_update_job_data_insert_key_value(job_id, "end_time", end_time_str, experiment_id)
    await job_update_status(job_id, "FAILED", experiment_id=experiment_id, session=session)
    return {"status": "FAILED", "end_time": end_time_str}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@api/transformerlab/routers/compute_provider.py` around lines 1673 - 1702,
Extract the inline crash-handling block into a new async helper in job_service
(e.g., async def mark_job_as_crashed(job_id: str, experiment_id: str, session:
AsyncSession) -> dict) that performs the end_time formatting, calls
job_update_job_data_insert_key_value(job_id, "end_time", end_time_str,
experiment_id) and job_update_status(job_id, "FAILED",
experiment_id=experiment_id, session=session), commits the session and returns a
standardized result (status, end_time); then replace the router's live_status ==
"crashed" branch to call job_service.mark_job_as_crashed(job_id,
job.get("experiment_id"), session), handle exceptions similar to current code
(logging the exception and returning the same error dict), and ensure the router
returns the same success payload when the service reports success.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/pyproject.toml`:
- Line 37: The dependency "transformerlab==0.0.78" in the project and the
corresponding lab-sdk version must be released together; either (A) coordinate
and publish transformerlab 0.0.78 to PyPI and cut the matching lab-sdk release
so the existing "transformerlab==0.0.78" pin is valid, or (B) if you cannot
release both now, revert the dependency to the last published transformerlab
version (e.g., change "transformerlab==0.0.78" back to "transformerlab==0.0.77")
and update the lab-sdk reference accordingly, then raise a follow-up to bump
both to 0.0.78 once published.

---

Outside diff comments:
In `@api/transformerlab/routers/compute_provider.py`:
- Around line 1157-1171: In _launch_sweep_jobs, the sweep child job commands are
not wrapped with the tfl-remote-trap wrapper like the main launch path, so
update the logic before creating the ClusterConfig to wrap command_with_secrets
(and/or final_setup if used) with the same wrapper used at the main launch (the
tfl-remote-trap wrapping applied at lines ~1533–1536); ensure you construct
wrapped_command = f"tfl-remote-trap {original_command}" (preserving any secret
injection already in command_with_secrets) and pass wrapped_command into
ClusterConfig (instead of command_with_secrets) so ClusterConfig.command carries
the trap wrapper for live_status tracking in _launch_sweep_jobs.
- Around line 2422-2436: Resume-from-checkpoint jobs are not wrapping the job
command with the tfl-remote-trap wrapper, causing inconsistent live_status; in
the resume_from_checkpoint flow before creating ClusterConfig (where
ClusterConfig is constructed with variables like formatted_cluster_name,
provider_display_name, provider.id, command, final_setup, env_vars, etc.) apply
the same wrapping logic used for sweep jobs: detect the resumed checkpoint
branch and prepend/wrap the existing command variable with the tfl-remote-trap
wrapper (or call the existing helper that does this) so the command passed into
ClusterConfig is the wrapped command.

---

Nitpick comments:
In `@api/transformerlab/routers/compute_provider.py`:
- Around line 1673-1702: Extract the inline crash-handling block into a new
async helper in job_service (e.g., async def mark_job_as_crashed(job_id: str,
experiment_id: str, session: AsyncSession) -> dict) that performs the end_time
formatting, calls job_update_job_data_insert_key_value(job_id, "end_time",
end_time_str, experiment_id) and job_update_status(job_id, "FAILED",
experiment_id=experiment_id, session=session), commits the session and returns a
standardized result (status, end_time); then replace the router's live_status ==
"crashed" branch to call job_service.mark_job_as_crashed(job_id,
job.get("experiment_id"), session), handle exceptions similar to current code
(logging the exception and returning the same error dict), and ensure the router
returns the same success payload when the service reports success.

In `@lab-sdk/src/lab/remote_trap.py`:
- Around line 30-43: The current fallback uses asyncio.get_event_loop() which
can emit DeprecationWarnings on Python 3.10+; update the fallback in the block
that calls asyncio.run(_set_live_status_async(job_id, status)) to prefer
asyncio.get_running_loop() and only call asyncio.get_event_loop() as a last
resort: attempt to retrieve the running loop with asyncio.get_running_loop(), if
it exists and is running schedule the task with
loop.create_task(_set_live_status_async(job_id, status)), otherwise use
loop.run_until_complete(_set_live_status_async(job_id, status)) (or if no
running loop is available, create one safely); keep the outer try/except
behavior and ensure exceptions still result in the existing silent return.

In `@src/renderer/components/Experiment/Tasks/JobProgress.tsx`:
- Around line 111-141: Add the missing live_status property to the JobData type
so TypeScript can type-check usages like job.job_data.live_status (used in
renderLiveStatusSubtitle); update the JobData interface/typedef to include
live_status with an appropriate union type (e.g., 'started' | 'crashed' |
'finished' | undefined/null) and adjust any places that construct or parse
job.job_data to satisfy the new property type.

deep1401 added 7 commits February 6, 2026 18:43

init test for adding remote trap around a command

13fe029

checkout to this branch and install for now

21e6827

form

e129fdc

order of commands

45bf488

fix job progress

80296bd

restore commands back

c19fe2b

sdk update

c684a88

remove old commands

3a05bb8

deep1401 marked this pull request as ready for review February 6, 2026 14:10

Merge branch 'main' into add/remote-trap

8f7f12a

deep1401 added the mode:multiuser label Feb 18, 2026

Merge branch 'main' of https://github.com/transformerlab/transformerl…

b5d6eb6

…ab-app into add/remote-trap

adjust version

d66796d

coderabbitai bot reviewed Feb 18, 2026

View reviewed changes

Comment thread api/pyproject.toml

aliasaria approved these changes Feb 19, 2026

View reviewed changes

Merge branch 'main' into add/remote-trap

f5d8d33

deep1401 merged commit 97879c0 into main Feb 19, 2026
11 checks passed

This was referenced Feb 23, 2026

Make provider logs permanent #1376

Merged

Add background worker to update sweep status #1393

Merged

coderabbitai bot mentioned this pull request Mar 3, 2026

Update job status to FAILED for a crash detected by tfl remote trap #1432

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remote trap to indicate live status of a command running irrespective of lab-sdk usage#1305

Remote trap to indicate live status of a command running irrespective of lab-sdk usage#1305
deep1401 merged 12 commits intomainfrom
add/remote-trap

deep1401 commented Feb 6, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

paragon-review bot commented Feb 6, 2026

Uh oh!

sentry bot commented Feb 6, 2026

Uh oh!

aliasaria commented Feb 9, 2026

Uh oh!

aliasaria commented Feb 11, 2026

Uh oh!

coderabbitai bot commented Feb 18, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

deep1401 commented Feb 6, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

paragon-review bot commented Feb 6, 2026

Uh oh!

sentry bot commented Feb 6, 2026

Codecov Report

Uh oh!

aliasaria commented Feb 9, 2026

Uh oh!

aliasaria commented Feb 11, 2026

Uh oh!

coderabbitai bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

deep1401 commented Feb 6, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 18, 2026 •

edited

Loading