Skip to content

Remote trap to indicate live status of a command running irrespective of lab-sdk usage#1305

Merged
deep1401 merged 12 commits intomainfrom
add/remote-trap
Feb 19, 2026
Merged

Remote trap to indicate live status of a command running irrespective of lab-sdk usage#1305
deep1401 merged 12 commits intomainfrom
add/remote-trap

Conversation

@deep1401
Copy link
Copy Markdown
Member

@deep1401 deep1401 commented Feb 6, 2026

To test:

  • Run a task with or without lab-sdk is fine
  • Install the lab sdk version through this branch
  • It will show failures/when the command executes below Job Progress

Summary by CodeRabbit

Release Notes

  • New Features

    • Job progress now displays live status updates, indicating whether jobs are started, running, or have crashed.
    • Remote job execution includes automatic status tracking for improved visibility.
  • Chores

    • Bumped version to 0.0.78.

@paragon-review
Copy link
Copy Markdown

paragon-review bot commented Feb 6, 2026

Paragon Review Unavailable

Hi @deep1401! To enable Paragon reviews on this repository, please register at https://home.polarity.cc

Once registered, connect your GitHub account and Paragon will automatically review your pull requests.

@sentry
Copy link
Copy Markdown

sentry bot commented Feb 6, 2026

Codecov Report

❌ Patch coverage is 0% with 16 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
api/transformerlab/routers/compute_provider.py 0.00% 16 Missing ⚠️

📢 Thoughts on this report? Let us know!

@deep1401 deep1401 marked this pull request as ready for review February 6, 2026 14:10
@aliasaria
Copy link
Copy Markdown
Member

What's a good way to test this?

@aliasaria
Copy link
Copy Markdown
Member

@aliasaria to test

How to test:

  • Run a job and it should show extra progress data
  • Try to run a job that crashes and see if it captures it

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 18, 2026

📝 Walkthrough

Walkthrough

This PR introduces remote execution tracing and live status monitoring for distributed job execution. A new tfl-remote-trap wrapper script is added to track remote command execution, update job status through started/finished/crashed states, and display live status in the UI. Remote providers now ensure the transformerlab SDK is installed and wrap commands with execution monitoring.

Changes

Cohort / File(s) Summary
Version Updates and Script Entry
api/pyproject.toml, lab-sdk/pyproject.toml
Version bump from 0.0.77 to 0.0.78; new CLI script entry tfl-remote-trap added to lab-sdk.
Remote Execution Wrapper
lab-sdk/src/lab/remote_trap.py
New module implementing tfl-remote-trap: wraps remote commands, manages job status transitions (started→finished/crashed), updates Job.live_status via async database operations, and handles event loop fallback logic.
Remote Provider Setup and Execution
api/transformerlab/routers/compute_provider.py
Integrates remote command wrapping for non-local providers, appends pip install step for SDK dependency, wraps launch commands with tfl-remote-trap, extends status checker with crash detection path, and ensures consistent status tracking across job launch and sweep scenarios.
Live Status UI Display
src/renderer/components/Experiment/Tasks/JobProgress.tsx
Adds renderLiveStatusSubtitle() helper to display job live_status (started, crashed, finished) as Typography subtitle in LAUNCHING/RUNNING job progress sections.

Sequence Diagram

sequenceDiagram
    participant User
    participant API as Compute Provider API
    participant Wrapper as tfl-remote-trap
    participant Command as Remote Command
    participant DB as Job Database
    participant UI as Job Progress UI

    User->>API: Launch remote job
    API->>DB: Create Job with live_status=null
    API->>Wrapper: Execute wrapped command (via tfl-remote-trap -- ...)
    
    Wrapper->>DB: Update live_status="started"
    DB-->>UI: Status change detected
    UI->>UI: Render "Job started" subtitle
    
    Wrapper->>Command: Execute actual command
    Command-->>Wrapper: Returns exit code
    
    alt Command Success (exit 0)
        Wrapper->>DB: Update live_status="finished"
    else Command Failed (exit ≠ 0)
        Wrapper->>DB: Update live_status="crashed"
    end
    
    DB-->>UI: Status change detected
    UI->>UI: Render final status subtitle
    
    API->>DB: Check job status, handle crashed state
    API-->>User: Job result with final status
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 Hops of joy across the wire,
Remote commands climb ever higher,
With trap-wrapped traces and status gleams,
Tracking jobs through digital streams,
From start to finish, no crashes unseen! 🎯

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: introducing a remote trap mechanism to track live status of commands running remotely, independent of lab-sdk usage. This aligns with the substantial changes across multiple files including the new remote_trap.py module and integration into the compute provider flow.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch add/remote-trap

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
api/transformerlab/routers/compute_provider.py (2)

1157-1171: ⚠️ Potential issue | 🟡 Minor

Missing tfl-remote-trap wrapper for sweep child jobs.

The sweep child jobs are also REMOTE jobs but the command is not wrapped with tfl-remote-trap. This means sweep child jobs won't have live_status tracking.

For consistency with the main launch path (lines 1533-1536), sweep jobs should also wrap their commands.

Proposed fix in _launch_sweep_jobs

Add wrapping logic before ClusterConfig creation around line 1155:

+                # Wrap command for remote providers (same as main launch path)
+                wrapped_command = command_with_secrets
+                if provider.type != ProviderType.LOCAL.value:
+                    wrapped_command = f"tfl-remote-trap -- {command_with_secrets}"
+
                 cluster_config = ClusterConfig(
                     cluster_name=formatted_cluster_name,
                     provider_name=provider_display_name,
                     provider_id=provider.id,
-                    command=command_with_secrets,
+                    command=wrapped_command,
                     setup=final_setup,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/transformerlab/routers/compute_provider.py` around lines 1157 - 1171, In
_launch_sweep_jobs, the sweep child job commands are not wrapped with the
tfl-remote-trap wrapper like the main launch path, so update the logic before
creating the ClusterConfig to wrap command_with_secrets (and/or final_setup if
used) with the same wrapper used at the main launch (the tfl-remote-trap
wrapping applied at lines ~1533–1536); ensure you construct wrapped_command =
f"tfl-remote-trap {original_command}" (preserving any secret injection already
in command_with_secrets) and pass wrapped_command into ClusterConfig (instead of
command_with_secrets) so ClusterConfig.command carries the trap wrapper for
live_status tracking in _launch_sweep_jobs.

2422-2436: ⚠️ Potential issue | 🟡 Minor

Missing tfl-remote-trap wrapper for resumed checkpoint jobs.

Similar to sweep jobs, resumed checkpoint jobs are REMOTE jobs but don't wrap the command with tfl-remote-trap. This will result in inconsistent live_status tracking.

Proposed fix in resume_from_checkpoint

Add wrapping logic before ClusterConfig creation around line 2420:

+    # Wrap command for remote providers
+    wrapped_command = command
+    if provider.type != ProviderType.LOCAL.value:
+        wrapped_command = f"tfl-remote-trap -- {command}"
+
     cluster_config = ClusterConfig(
         cluster_name=formatted_cluster_name,
         provider_name=provider_display_name,
         provider_id=provider.id,
-        command=command,
+        command=wrapped_command,
         setup=final_setup,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/transformerlab/routers/compute_provider.py` around lines 2422 - 2436,
Resume-from-checkpoint jobs are not wrapping the job command with the
tfl-remote-trap wrapper, causing inconsistent live_status; in the
resume_from_checkpoint flow before creating ClusterConfig (where ClusterConfig
is constructed with variables like formatted_cluster_name,
provider_display_name, provider.id, command, final_setup, env_vars, etc.) apply
the same wrapping logic used for sweep jobs: detect the resumed checkpoint
branch and prepend/wrap the existing command variable with the tfl-remote-trap
wrapper (or call the existing helper that does this) so the command passed into
ClusterConfig is the wrapped command.
🧹 Nitpick comments (3)
src/renderer/components/Experiment/Tasks/JobProgress.tsx (1)

111-141: Well-implemented live status display with clear state handling.

The function correctly handles the three live status states with appropriate colors (neutral for started/finished, danger for crashed). The null check ensures graceful fallback when live_status is unavailable.

Consider adding live_status to the JobData interface for better type safety:

 interface JobData {
   start_time?: string;
   end_time?: string;
   completion_status?: string;
   completion_details?: string;
+  live_status?: 'started' | 'crashed' | 'finished';
   [key: string]: any;
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/renderer/components/Experiment/Tasks/JobProgress.tsx` around lines 111 -
141, Add the missing live_status property to the JobData type so TypeScript can
type-check usages like job.job_data.live_status (used in
renderLiveStatusSubtitle); update the JobData interface/typedef to include
live_status with an appropriate union type (e.g., 'started' | 'crashed' |
'finished' | undefined/null) and adjust any places that construct or parse
job.job_data to satisfy the new property type.
lab-sdk/src/lab/remote_trap.py (1)

30-43: Consider using asyncio.get_running_loop() for Python 3.10+ compatibility.

asyncio.get_event_loop() emits a DeprecationWarning when called without a running event loop in Python 3.10+. The current code works but may generate warnings.

Proposed modernization
     try:
         asyncio.run(_set_live_status_async(job_id, status))
     except RuntimeError:
-        # Fallback in case an event loop already exists.
         try:
-            loop = asyncio.get_event_loop()
-            if loop.is_running():
-                # In the unlikely case we're already in an event loop, schedule the task
-                # but don't wait on it (best-effort update).
+            loop = asyncio.get_running_loop()
+            # Already in an event loop, schedule the task but don't wait (best-effort).
-                loop.create_task(_set_live_status_async(job_id, status))
-            else:
-                loop.run_until_complete(_set_live_status_async(job_id, status))
+            loop.create_task(_set_live_status_async(job_id, status))
         except Exception:
             return
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lab-sdk/src/lab/remote_trap.py` around lines 30 - 43, The current fallback
uses asyncio.get_event_loop() which can emit DeprecationWarnings on Python
3.10+; update the fallback in the block that calls
asyncio.run(_set_live_status_async(job_id, status)) to prefer
asyncio.get_running_loop() and only call asyncio.get_event_loop() as a last
resort: attempt to retrieve the running loop with asyncio.get_running_loop(), if
it exists and is running schedule the task with
loop.create_task(_set_live_status_async(job_id, status)), otherwise use
loop.run_until_complete(_set_live_status_async(job_id, status)) (or if no
running loop is available, create one safely); keep the outer try/except
behavior and ensure exceptions still result in the existing silent return.
api/transformerlab/routers/compute_provider.py (1)

1673-1702: Crash detection logic is correct, but consider extracting to a service.

The live_status crash handling correctly marks the job as FAILED with proper end_time recording. However, this adds business logic directly in the router.

Per coding guidelines, business logic should be placed in api/transformerlab/services/ using the Service pattern. The existing code in this file follows a similar inline pattern, so this is consistent, but future refactoring could extract job status update logic to job_service.

Potential service extraction

Consider adding a helper in job_service.py:

async def mark_job_as_crashed(job_id: str, experiment_id: str, session: AsyncSession) -> dict:
    """Mark a job as FAILED due to remote command crash."""
    end_time_str = time.strftime("%Y-%m-%d %H:%M:%S", time.gmtime())
    await job_update_job_data_insert_key_value(job_id, "end_time", end_time_str, experiment_id)
    await job_update_status(job_id, "FAILED", experiment_id=experiment_id, session=session)
    return {"status": "FAILED", "end_time": end_time_str}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/transformerlab/routers/compute_provider.py` around lines 1673 - 1702,
Extract the inline crash-handling block into a new async helper in job_service
(e.g., async def mark_job_as_crashed(job_id: str, experiment_id: str, session:
AsyncSession) -> dict) that performs the end_time formatting, calls
job_update_job_data_insert_key_value(job_id, "end_time", end_time_str,
experiment_id) and job_update_status(job_id, "FAILED",
experiment_id=experiment_id, session=session), commits the session and returns a
standardized result (status, end_time); then replace the router's live_status ==
"crashed" branch to call job_service.mark_job_as_crashed(job_id,
job.get("experiment_id"), session), handle exceptions similar to current code
(logging the exception and returning the same error dict), and ensure the router
returns the same success payload when the service reports success.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/pyproject.toml`:
- Line 37: The dependency "transformerlab==0.0.78" in the project and the
corresponding lab-sdk version must be released together; either (A) coordinate
and publish transformerlab 0.0.78 to PyPI and cut the matching lab-sdk release
so the existing "transformerlab==0.0.78" pin is valid, or (B) if you cannot
release both now, revert the dependency to the last published transformerlab
version (e.g., change "transformerlab==0.0.78" back to "transformerlab==0.0.77")
and update the lab-sdk reference accordingly, then raise a follow-up to bump
both to 0.0.78 once published.

---

Outside diff comments:
In `@api/transformerlab/routers/compute_provider.py`:
- Around line 1157-1171: In _launch_sweep_jobs, the sweep child job commands are
not wrapped with the tfl-remote-trap wrapper like the main launch path, so
update the logic before creating the ClusterConfig to wrap command_with_secrets
(and/or final_setup if used) with the same wrapper used at the main launch (the
tfl-remote-trap wrapping applied at lines ~1533–1536); ensure you construct
wrapped_command = f"tfl-remote-trap {original_command}" (preserving any secret
injection already in command_with_secrets) and pass wrapped_command into
ClusterConfig (instead of command_with_secrets) so ClusterConfig.command carries
the trap wrapper for live_status tracking in _launch_sweep_jobs.
- Around line 2422-2436: Resume-from-checkpoint jobs are not wrapping the job
command with the tfl-remote-trap wrapper, causing inconsistent live_status; in
the resume_from_checkpoint flow before creating ClusterConfig (where
ClusterConfig is constructed with variables like formatted_cluster_name,
provider_display_name, provider.id, command, final_setup, env_vars, etc.) apply
the same wrapping logic used for sweep jobs: detect the resumed checkpoint
branch and prepend/wrap the existing command variable with the tfl-remote-trap
wrapper (or call the existing helper that does this) so the command passed into
ClusterConfig is the wrapped command.

---

Nitpick comments:
In `@api/transformerlab/routers/compute_provider.py`:
- Around line 1673-1702: Extract the inline crash-handling block into a new
async helper in job_service (e.g., async def mark_job_as_crashed(job_id: str,
experiment_id: str, session: AsyncSession) -> dict) that performs the end_time
formatting, calls job_update_job_data_insert_key_value(job_id, "end_time",
end_time_str, experiment_id) and job_update_status(job_id, "FAILED",
experiment_id=experiment_id, session=session), commits the session and returns a
standardized result (status, end_time); then replace the router's live_status ==
"crashed" branch to call job_service.mark_job_as_crashed(job_id,
job.get("experiment_id"), session), handle exceptions similar to current code
(logging the exception and returning the same error dict), and ensure the router
returns the same success payload when the service reports success.

In `@lab-sdk/src/lab/remote_trap.py`:
- Around line 30-43: The current fallback uses asyncio.get_event_loop() which
can emit DeprecationWarnings on Python 3.10+; update the fallback in the block
that calls asyncio.run(_set_live_status_async(job_id, status)) to prefer
asyncio.get_running_loop() and only call asyncio.get_event_loop() as a last
resort: attempt to retrieve the running loop with asyncio.get_running_loop(), if
it exists and is running schedule the task with
loop.create_task(_set_live_status_async(job_id, status)), otherwise use
loop.run_until_complete(_set_live_status_async(job_id, status)) (or if no
running loop is available, create one safely); keep the outer try/except
behavior and ensure exceptions still result in the existing silent return.

In `@src/renderer/components/Experiment/Tasks/JobProgress.tsx`:
- Around line 111-141: Add the missing live_status property to the JobData type
so TypeScript can type-check usages like job.job_data.live_status (used in
renderLiveStatusSubtitle); update the JobData interface/typedef to include
live_status with an appropriate union type (e.g., 'started' | 'crashed' |
'finished' | undefined/null) and adjust any places that construct or parse
job.job_data to satisfy the new property type.

Comment thread api/pyproject.toml
@deep1401 deep1401 merged commit 97879c0 into main Feb 19, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants