Remote trap to indicate live status of a command running irrespective of lab-sdk usage#1305
Remote trap to indicate live status of a command running irrespective of lab-sdk usage#1305
Conversation
|
Paragon Review Unavailable Hi @deep1401! To enable Paragon reviews on this repository, please register at https://home.polarity.cc Once registered, connect your GitHub account and Paragon will automatically review your pull requests. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
What's a good way to test this? |
|
@aliasaria to test How to test:
|
📝 WalkthroughWalkthroughThis PR introduces remote execution tracing and live status monitoring for distributed job execution. A new Changes
Sequence DiagramsequenceDiagram
participant User
participant API as Compute Provider API
participant Wrapper as tfl-remote-trap
participant Command as Remote Command
participant DB as Job Database
participant UI as Job Progress UI
User->>API: Launch remote job
API->>DB: Create Job with live_status=null
API->>Wrapper: Execute wrapped command (via tfl-remote-trap -- ...)
Wrapper->>DB: Update live_status="started"
DB-->>UI: Status change detected
UI->>UI: Render "Job started" subtitle
Wrapper->>Command: Execute actual command
Command-->>Wrapper: Returns exit code
alt Command Success (exit 0)
Wrapper->>DB: Update live_status="finished"
else Command Failed (exit ≠ 0)
Wrapper->>DB: Update live_status="crashed"
end
DB-->>UI: Status change detected
UI->>UI: Render final status subtitle
API->>DB: Check job status, handle crashed state
API-->>User: Job result with final status
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
api/transformerlab/routers/compute_provider.py (2)
1157-1171:⚠️ Potential issue | 🟡 MinorMissing
tfl-remote-trapwrapper for sweep child jobs.The sweep child jobs are also REMOTE jobs but the command is not wrapped with
tfl-remote-trap. This means sweep child jobs won't have live_status tracking.For consistency with the main launch path (lines 1533-1536), sweep jobs should also wrap their commands.
Proposed fix in _launch_sweep_jobs
Add wrapping logic before ClusterConfig creation around line 1155:
+ # Wrap command for remote providers (same as main launch path) + wrapped_command = command_with_secrets + if provider.type != ProviderType.LOCAL.value: + wrapped_command = f"tfl-remote-trap -- {command_with_secrets}" + cluster_config = ClusterConfig( cluster_name=formatted_cluster_name, provider_name=provider_display_name, provider_id=provider.id, - command=command_with_secrets, + command=wrapped_command, setup=final_setup,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/transformerlab/routers/compute_provider.py` around lines 1157 - 1171, In _launch_sweep_jobs, the sweep child job commands are not wrapped with the tfl-remote-trap wrapper like the main launch path, so update the logic before creating the ClusterConfig to wrap command_with_secrets (and/or final_setup if used) with the same wrapper used at the main launch (the tfl-remote-trap wrapping applied at lines ~1533–1536); ensure you construct wrapped_command = f"tfl-remote-trap {original_command}" (preserving any secret injection already in command_with_secrets) and pass wrapped_command into ClusterConfig (instead of command_with_secrets) so ClusterConfig.command carries the trap wrapper for live_status tracking in _launch_sweep_jobs.
2422-2436:⚠️ Potential issue | 🟡 MinorMissing
tfl-remote-trapwrapper for resumed checkpoint jobs.Similar to sweep jobs, resumed checkpoint jobs are REMOTE jobs but don't wrap the command with
tfl-remote-trap. This will result in inconsistent live_status tracking.Proposed fix in resume_from_checkpoint
Add wrapping logic before ClusterConfig creation around line 2420:
+ # Wrap command for remote providers + wrapped_command = command + if provider.type != ProviderType.LOCAL.value: + wrapped_command = f"tfl-remote-trap -- {command}" + cluster_config = ClusterConfig( cluster_name=formatted_cluster_name, provider_name=provider_display_name, provider_id=provider.id, - command=command, + command=wrapped_command, setup=final_setup,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/transformerlab/routers/compute_provider.py` around lines 2422 - 2436, Resume-from-checkpoint jobs are not wrapping the job command with the tfl-remote-trap wrapper, causing inconsistent live_status; in the resume_from_checkpoint flow before creating ClusterConfig (where ClusterConfig is constructed with variables like formatted_cluster_name, provider_display_name, provider.id, command, final_setup, env_vars, etc.) apply the same wrapping logic used for sweep jobs: detect the resumed checkpoint branch and prepend/wrap the existing command variable with the tfl-remote-trap wrapper (or call the existing helper that does this) so the command passed into ClusterConfig is the wrapped command.
🧹 Nitpick comments (3)
src/renderer/components/Experiment/Tasks/JobProgress.tsx (1)
111-141: Well-implemented live status display with clear state handling.The function correctly handles the three live status states with appropriate colors (neutral for started/finished, danger for crashed). The null check ensures graceful fallback when live_status is unavailable.
Consider adding
live_statusto theJobDatainterface for better type safety:interface JobData { start_time?: string; end_time?: string; completion_status?: string; completion_details?: string; + live_status?: 'started' | 'crashed' | 'finished'; [key: string]: any; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/renderer/components/Experiment/Tasks/JobProgress.tsx` around lines 111 - 141, Add the missing live_status property to the JobData type so TypeScript can type-check usages like job.job_data.live_status (used in renderLiveStatusSubtitle); update the JobData interface/typedef to include live_status with an appropriate union type (e.g., 'started' | 'crashed' | 'finished' | undefined/null) and adjust any places that construct or parse job.job_data to satisfy the new property type.lab-sdk/src/lab/remote_trap.py (1)
30-43: Consider usingasyncio.get_running_loop()for Python 3.10+ compatibility.
asyncio.get_event_loop()emits a DeprecationWarning when called without a running event loop in Python 3.10+. The current code works but may generate warnings.Proposed modernization
try: asyncio.run(_set_live_status_async(job_id, status)) except RuntimeError: - # Fallback in case an event loop already exists. try: - loop = asyncio.get_event_loop() - if loop.is_running(): - # In the unlikely case we're already in an event loop, schedule the task - # but don't wait on it (best-effort update). + loop = asyncio.get_running_loop() + # Already in an event loop, schedule the task but don't wait (best-effort). - loop.create_task(_set_live_status_async(job_id, status)) - else: - loop.run_until_complete(_set_live_status_async(job_id, status)) + loop.create_task(_set_live_status_async(job_id, status)) except Exception: return🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lab-sdk/src/lab/remote_trap.py` around lines 30 - 43, The current fallback uses asyncio.get_event_loop() which can emit DeprecationWarnings on Python 3.10+; update the fallback in the block that calls asyncio.run(_set_live_status_async(job_id, status)) to prefer asyncio.get_running_loop() and only call asyncio.get_event_loop() as a last resort: attempt to retrieve the running loop with asyncio.get_running_loop(), if it exists and is running schedule the task with loop.create_task(_set_live_status_async(job_id, status)), otherwise use loop.run_until_complete(_set_live_status_async(job_id, status)) (or if no running loop is available, create one safely); keep the outer try/except behavior and ensure exceptions still result in the existing silent return.api/transformerlab/routers/compute_provider.py (1)
1673-1702: Crash detection logic is correct, but consider extracting to a service.The live_status crash handling correctly marks the job as FAILED with proper end_time recording. However, this adds business logic directly in the router.
Per coding guidelines, business logic should be placed in
api/transformerlab/services/using the Service pattern. The existing code in this file follows a similar inline pattern, so this is consistent, but future refactoring could extract job status update logic tojob_service.Potential service extraction
Consider adding a helper in
job_service.py:async def mark_job_as_crashed(job_id: str, experiment_id: str, session: AsyncSession) -> dict: """Mark a job as FAILED due to remote command crash.""" end_time_str = time.strftime("%Y-%m-%d %H:%M:%S", time.gmtime()) await job_update_job_data_insert_key_value(job_id, "end_time", end_time_str, experiment_id) await job_update_status(job_id, "FAILED", experiment_id=experiment_id, session=session) return {"status": "FAILED", "end_time": end_time_str}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/transformerlab/routers/compute_provider.py` around lines 1673 - 1702, Extract the inline crash-handling block into a new async helper in job_service (e.g., async def mark_job_as_crashed(job_id: str, experiment_id: str, session: AsyncSession) -> dict) that performs the end_time formatting, calls job_update_job_data_insert_key_value(job_id, "end_time", end_time_str, experiment_id) and job_update_status(job_id, "FAILED", experiment_id=experiment_id, session=session), commits the session and returns a standardized result (status, end_time); then replace the router's live_status == "crashed" branch to call job_service.mark_job_as_crashed(job_id, job.get("experiment_id"), session), handle exceptions similar to current code (logging the exception and returning the same error dict), and ensure the router returns the same success payload when the service reports success.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@api/pyproject.toml`:
- Line 37: The dependency "transformerlab==0.0.78" in the project and the
corresponding lab-sdk version must be released together; either (A) coordinate
and publish transformerlab 0.0.78 to PyPI and cut the matching lab-sdk release
so the existing "transformerlab==0.0.78" pin is valid, or (B) if you cannot
release both now, revert the dependency to the last published transformerlab
version (e.g., change "transformerlab==0.0.78" back to "transformerlab==0.0.77")
and update the lab-sdk reference accordingly, then raise a follow-up to bump
both to 0.0.78 once published.
---
Outside diff comments:
In `@api/transformerlab/routers/compute_provider.py`:
- Around line 1157-1171: In _launch_sweep_jobs, the sweep child job commands are
not wrapped with the tfl-remote-trap wrapper like the main launch path, so
update the logic before creating the ClusterConfig to wrap command_with_secrets
(and/or final_setup if used) with the same wrapper used at the main launch (the
tfl-remote-trap wrapping applied at lines ~1533–1536); ensure you construct
wrapped_command = f"tfl-remote-trap {original_command}" (preserving any secret
injection already in command_with_secrets) and pass wrapped_command into
ClusterConfig (instead of command_with_secrets) so ClusterConfig.command carries
the trap wrapper for live_status tracking in _launch_sweep_jobs.
- Around line 2422-2436: Resume-from-checkpoint jobs are not wrapping the job
command with the tfl-remote-trap wrapper, causing inconsistent live_status; in
the resume_from_checkpoint flow before creating ClusterConfig (where
ClusterConfig is constructed with variables like formatted_cluster_name,
provider_display_name, provider.id, command, final_setup, env_vars, etc.) apply
the same wrapping logic used for sweep jobs: detect the resumed checkpoint
branch and prepend/wrap the existing command variable with the tfl-remote-trap
wrapper (or call the existing helper that does this) so the command passed into
ClusterConfig is the wrapped command.
---
Nitpick comments:
In `@api/transformerlab/routers/compute_provider.py`:
- Around line 1673-1702: Extract the inline crash-handling block into a new
async helper in job_service (e.g., async def mark_job_as_crashed(job_id: str,
experiment_id: str, session: AsyncSession) -> dict) that performs the end_time
formatting, calls job_update_job_data_insert_key_value(job_id, "end_time",
end_time_str, experiment_id) and job_update_status(job_id, "FAILED",
experiment_id=experiment_id, session=session), commits the session and returns a
standardized result (status, end_time); then replace the router's live_status ==
"crashed" branch to call job_service.mark_job_as_crashed(job_id,
job.get("experiment_id"), session), handle exceptions similar to current code
(logging the exception and returning the same error dict), and ensure the router
returns the same success payload when the service reports success.
In `@lab-sdk/src/lab/remote_trap.py`:
- Around line 30-43: The current fallback uses asyncio.get_event_loop() which
can emit DeprecationWarnings on Python 3.10+; update the fallback in the block
that calls asyncio.run(_set_live_status_async(job_id, status)) to prefer
asyncio.get_running_loop() and only call asyncio.get_event_loop() as a last
resort: attempt to retrieve the running loop with asyncio.get_running_loop(), if
it exists and is running schedule the task with
loop.create_task(_set_live_status_async(job_id, status)), otherwise use
loop.run_until_complete(_set_live_status_async(job_id, status)) (or if no
running loop is available, create one safely); keep the outer try/except
behavior and ensure exceptions still result in the existing silent return.
In `@src/renderer/components/Experiment/Tasks/JobProgress.tsx`:
- Around line 111-141: Add the missing live_status property to the JobData type
so TypeScript can type-check usages like job.job_data.live_status (used in
renderLiveStatusSubtitle); update the JobData interface/typedef to include
live_status with an appropriate union type (e.g., 'started' | 'crashed' |
'finished' | undefined/null) and adjust any places that construct or parse
job.job_data to satisfy the new property type.
To test:
Summary by CodeRabbit
Release Notes
New Features
Chores