Fix job not found errors when running with local providers by deep1401 · Pull Request #1397 · transformerlab/transformerlab-app

deep1401 · 2026-02-25T16:07:11Z

Also sets home directory to be within the job_dir/workspace inside local_provider_runs

Summary by CodeRabbit

New Features
- Improved job isolation with dedicated per-job workspace directories and proper environment variable configuration.
- Enhanced team-scoped context propagation for background tasks and sweep operations.
Bug Fixes
- Added validation to gracefully handle missing providers during job launch.
- Fixed environment variable resolution for workspace-dependent tools.

coderabbitai · 2026-02-25T16:07:34Z

📝 Walkthrough

Walkthrough

Introduces team-scoped context propagation across the local provider cluster launch pipeline. Changes add per-job workspace HOME setup, refactor workspace directory injection to eliminate organization context mutation, add team_id parameter forwarding through sweep and enqueue paths, and enhance the local launch worker with organization context management, provider validation, and critical-section execution.

Changes

Cohort / File(s)	Summary
Per-job Workspace Setup `api/transformerlab/compute_providers/local.py`	Adds HOME and PATH environment variables pointing to isolated per-job workspace directory to ensure tools relying on ~ and $HOME resolve within the job workspace.
Team Context Propagation & Workspace Refactor `api/transformerlab/routers/compute_provider.py`	Removes organization context mutation pattern; refactors to use `get_workspace_dir` for local workspace injection; adds `team_id` parameter forwarding through sweep launcher and enqueue calls; updates signatures of `enqueue_local_launch` and `_launch_sweep_jobs`.
Local Launch Worker Enhancement `api/transformerlab/services/local_provider_queue.py`	Extends `LocalLaunchWorkItem` with `team_id` field; adds organization context scoping at worker start/cleanup; implements provider existence validation with failure handling; defers provider instantiation; wraps `launch_cluster` execution in critical section guarded by `_worker_lock`; enhances error handling to release quota holds and propagate launch results including `orchestrator_request_id`.

Sequence Diagram(s)

sequenceDiagram
    participant Router as Compute Provider Router
    participant Queue as Local Provider Queue
    participant Worker as Queue Worker
    participant OrgCtx as Organization Context
    participant Provider as LocalProvider
    
    Router->>Queue: enqueue_local_launch(..., team_id)
    Queue->>Queue: Create LocalLaunchWorkItem(team_id)
    
    Worker->>OrgCtx: set_current_org_id(item.team_id)
    Worker->>OrgCtx: lab_dirs.set_organization_id(item.team_id)
    
    Worker->>Provider: Fetch provider by ID
    alt Provider Exists
        Worker->>Provider: launch_cluster (within critical section)
        Provider-->>Worker: launch_result
        Worker->>Worker: Store provider_launch_result & orchestrator_request_id
    else Provider Missing
        Worker->>Worker: Mark job as FAILED
        Worker->>Worker: Release quota hold if present
    end
    
    Worker->>OrgCtx: Clear organization context (finally block)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Fix jobs running with local provider using lab-sdk #1379: Modifies the same local workspace and organization context handling in compute_provider.py that this PR refactors — directly related through workspace environment propagation pattern changes.

Suggested reviewers

aliasaria

Poem

🐰 A rabbit's hop through contexts new,
Team IDs flow where paths once blew,
Critical locks guard launches tight,
HOME grows in per-job garden's light! 🌱

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Fix job not found errors when running with local providers' directly relates to the core changes: fixing job execution issues with local providers by implementing workspace isolation and team-scoped context propagation.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/running-local-providers

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

sentry · 2026-02-25T16:10:48Z

Codecov Report

❌ Patch coverage is 8.10811% with 34 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...pi/transformerlab/services/local_provider_queue.py	9.67%	28 Missing ⚠️
api/transformerlab/compute_providers/local.py	0.00%	3 Missing ⚠️
api/transformerlab/routers/compute_provider.py	0.00%	3 Missing ⚠️

📢 Thoughts on this report? Let us know!

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

api/transformerlab/routers/compute_provider.py (2)

2497-2505: ⚠️ Potential issue | 🟠 Major

resume_from_checkpoint inconsistently mutates org context and bypasses the serialization queue for LOCAL provider.

Two issues:

This block still uses the old set_organization_id(team_id) / try/finally / set_organization_id(None) pattern that was deliberately removed from launch_template_on_provider (Lines 1508-1515). Since the middleware has already set org context for this request, the mutation is unnecessary and inconsistent.

For LOCAL provider, resume_from_checkpoint calls provider_instance.launch_cluster directly via asyncio.to_thread (Line 2584) instead of going through enqueue_local_launch. This bypasses _worker_lock and the serialization queue, potentially running multiple local launches concurrently — contradicting the queue's invariant.
♻️ Proposed fix — align with launch_template_on_provider pattern
-    # For local provider, set TFL_WORKSPACE_DIR so the lab SDK in the subprocess finds the job dir
-    if provider.type == ProviderType.LOCAL.value and team_id:
-        set_organization_id(team_id)
-        try:
-            workspace_dir = await get_workspace_dir()
-            if workspace_dir and not storage.is_remote_path(workspace_dir):
-                env_vars["TFL_WORKSPACE_DIR"] = workspace_dir
-        finally:
-            set_organization_id(None)
+    # For local provider, set TFL_WORKSPACE_DIR so the lab SDK in the subprocess finds the job dir.
+    # Org context is already set by authentication middleware.
+    if provider.type == ProviderType.LOCAL.value and team_id:
+        workspace_dir = await get_workspace_dir()
+        if workspace_dir and not storage.is_remote_path(workspace_dir):
+            env_vars["TFL_WORKSPACE_DIR"] = workspace_dir
For the queue bypass, route through enqueue_local_launch for LOCAL provider (similar to launch_template_on_provider Lines 1657-1679) instead of calling launch_cluster directly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/transformerlab/routers/compute_provider.py` around lines 2497 - 2505, The
resume_from_checkpoint path mutates org context with
set_organization_id(team_id)/try/finally and bypasses the local-launch
serialization by calling provider_instance.launch_cluster via asyncio.to_thread;
remove the unnecessary set_organization_id(...) calls (the request middleware
already sets org context) so resume_from_checkpoint matches
launch_template_on_provider, and for ProviderType.LOCAL.value change the direct
launch to use the existing enqueue_local_launch flow (instead of
asyncio.to_thread(provider_instance.launch_cluster...)) so the call is
serialized under _worker_lock/serialization queue just like
launch_template_on_provider; update any surrounding logic to pass the same
arguments used by enqueue_local_launch.
1069-1076: ⚠️ Potential issue | 🟠 Major

_launch_sweep_jobs redundantly re-sets org context and LOCAL sweep child jobs bypass the serialization queue.

set_current_org_id(team_id) and lab_set_org_id(team_id) are already called at Lines 979-981 at the top of _launch_sweep_jobs. The set_organization_id(team_id) block at Line 1070 redundantly re-sets the same value and then immediately clears it in finally, transiently resetting the context to None mid-loop if an exception is thrown in get_workspace_dir.

More critically, LOCAL provider sweep child jobs call provider_instance.launch_cluster directly via asyncio.to_thread (Line 1209) rather than enqueue_local_launch, bypassing _worker_lock and allowing concurrent local launches that the queue was introduced to serialize.
♻️ Proposed fix
-                if provider.type == ProviderType.LOCAL.value and team_id:
-                    set_organization_id(team_id)
-                    try:
-                        workspace_dir = await get_workspace_dir()
-                        if workspace_dir and not storage.is_remote_path(workspace_dir):
-                            env_vars["TFL_WORKSPACE_DIR"] = workspace_dir
-                    finally:
-                        set_organization_id(None)
+                if provider.type == ProviderType.LOCAL.value and team_id:
+                    workspace_dir = await get_workspace_dir()
+                    if workspace_dir and not storage.is_remote_path(workspace_dir):
+                        env_vars["TFL_WORKSPACE_DIR"] = workspace_dir
For the queue bypass, LOCAL provider child sweep jobs should be routed through enqueue_local_launch to respect the serialization invariant.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/transformerlab/routers/compute_provider.py` around lines 1069 - 1076,
Remove the redundant org-context toggling in _launch_sweep_jobs by deleting the
set_organization_id(team_id)/finally block around get_workspace_dir (since
set_current_org_id(team_id) and lab_set_org_id(team_id) are already applied at
the top); instead, ensure get_workspace_dir reads the already-set context. Also
fix the LOCAL provider child job path so it uses enqueue_local_launch rather
than calling provider_instance.launch_cluster via asyncio.to_thread (which
bypasses _worker_lock); replace the direct
asyncio.to_thread(provider_instance.launch_cluster, ...) call with a call that
enqueues the launch through enqueue_local_launch so local launches are
serialized by _worker_lock.

api/transformerlab/compute_providers/local.py (2)

244-259: ⚠️ Potential issue | 🔴 Critical

Pre-existing logic inversion in get_cluster_status defeats the PR's fix.

os.kill(pid, 0) returns None on success (process alive) and raises an exception when the process is gone. The current branch if os_killed is not None is therefore always False, making ClusterState.UP dead code and every running local job appear as DOWN. This causes check-status to immediately mark all local jobs as COMPLETE regardless of actual process state.

🐛 Proposed fix

-        try:
-            pid = int(pid_file.read_text().strip())
-            os_killed = os.kill(pid, 0)
-            # Return up only if the process is not running
-            if os_killed is not None:
-                return ClusterStatus(
-                    cluster_name=cluster_name,
-                    state=ClusterState.UP,
-                    status_message="Process running",
-                )
-            else:
-                return ClusterStatus(
-                    cluster_name=cluster_name,
-                    state=ClusterState.DOWN,
-                    status_message="Process not running",
-                )
+        try:
+            pid = int(pid_file.read_text().strip())
+            os.kill(pid, 0)  # raises if process is gone
+            return ClusterStatus(
+                cluster_name=cluster_name,
+                state=ClusterState.UP,
+                status_message="Process running",
+            )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@api/transformerlab/compute_providers/local.py` around lines 244 - 259, In
get_cluster_status, os.kill(pid, 0) returns None on success (process exists) and
raises on failure, so the current if os_killed is not None branch is inverted
and marks running processes as DOWN; change the logic to call os.kill(pid, 0)
inside a try and treat the no-exception path as ClusterState.UP (status_message
"Process running"), catch ProcessLookupError (and/or OSError with errno ESRCH)
to return ClusterState.DOWN ("Process not running"), treat PermissionError as UP
(process exists but not permitted), and handle any other exceptions by logging
and returning an appropriate DOWN status; update references to
pid_file.read_text(), ClusterStatus and ClusterState.UP/DOWN accordingly.

185-192: ⚠️ Potential issue | 🟠 Major

File descriptor leak in Popen call.

The open() handles for stdout.log and stderr.log are passed directly and never closed. The parent process will hold these file descriptors open until garbage-collected.

🔒 Proposed fix

-        proc = subprocess.Popen(
-            ["/bin/bash", "-c", config.command or "true"],
-            cwd=str(job_dir),
-            env=env,
-            stdout=open(job_dir / "stdout.log", "w"),
-            stderr=open(job_dir / "stderr.log", "w"),
-            start_new_session=True,
-        )
+        stdout_log = open(job_dir / "stdout.log", "w")
+        stderr_log = open(job_dir / "stderr.log", "w")
+        proc = subprocess.Popen(
+            ["/bin/bash", "-c", config.command or "true"],
+            cwd=str(job_dir),
+            env=env,
+            stdout=stdout_log,
+            stderr=stderr_log,
+            start_new_session=True,
+        )
+        stdout_log.close()
+        stderr_log.close()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@api/transformerlab/compute_providers/local.py` around lines 185 - 192, The
Popen call leaks file descriptors because open(job_dir / "stdout.log", "w") and
stderr are passed directly; fix by opening those files into local variables
(e.g., stdout_f, stderr_f) before calling subprocess.Popen, call proc =
subprocess.Popen(..., stdout=stdout_f, stderr=stderr_f, ...), and then close the
parent-side file objects in a finally block (or immediately after Popen returns)
to ensure they are closed even on exceptions; reference the subprocess.Popen
invocation and the job_dir/stdout.log and stderr.log file opens when making the
change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@api/transformerlab/compute_providers/local.py`:
- Around line 244-259: In get_cluster_status, os.kill(pid, 0) returns None on
success (process exists) and raises on failure, so the current if os_killed is
not None branch is inverted and marks running processes as DOWN; change the
logic to call os.kill(pid, 0) inside a try and treat the no-exception path as
ClusterState.UP (status_message "Process running"), catch ProcessLookupError
(and/or OSError with errno ESRCH) to return ClusterState.DOWN ("Process not
running"), treat PermissionError as UP (process exists but not permitted), and
handle any other exceptions by logging and returning an appropriate DOWN status;
update references to pid_file.read_text(), ClusterStatus and
ClusterState.UP/DOWN accordingly.
- Around line 185-192: The Popen call leaks file descriptors because
open(job_dir / "stdout.log", "w") and stderr are passed directly; fix by opening
those files into local variables (e.g., stdout_f, stderr_f) before calling
subprocess.Popen, call proc = subprocess.Popen(..., stdout=stdout_f,
stderr=stderr_f, ...), and then close the parent-side file objects in a finally
block (or immediately after Popen returns) to ensure they are closed even on
exceptions; reference the subprocess.Popen invocation and the job_dir/stdout.log
and stderr.log file opens when making the change.

In `@api/transformerlab/routers/compute_provider.py`:
- Around line 2497-2505: The resume_from_checkpoint path mutates org context
with set_organization_id(team_id)/try/finally and bypasses the local-launch
serialization by calling provider_instance.launch_cluster via asyncio.to_thread;
remove the unnecessary set_organization_id(...) calls (the request middleware
already sets org context) so resume_from_checkpoint matches
launch_template_on_provider, and for ProviderType.LOCAL.value change the direct
launch to use the existing enqueue_local_launch flow (instead of
asyncio.to_thread(provider_instance.launch_cluster...)) so the call is
serialized under _worker_lock/serialization queue just like
launch_template_on_provider; update any surrounding logic to pass the same
arguments used by enqueue_local_launch.
- Around line 1069-1076: Remove the redundant org-context toggling in
_launch_sweep_jobs by deleting the set_organization_id(team_id)/finally block
around get_workspace_dir (since set_current_org_id(team_id) and
lab_set_org_id(team_id) are already applied at the top); instead, ensure
get_workspace_dir reads the already-set context. Also fix the LOCAL provider
child job path so it uses enqueue_local_launch rather than calling
provider_instance.launch_cluster via asyncio.to_thread (which bypasses
_worker_lock); replace the direct
asyncio.to_thread(provider_instance.launch_cluster, ...) call with a call that
enqueues the launch through enqueue_local_launch so local launches are
serialized by _worker_lock.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 07bcedd and 1930778.

📒 Files selected for processing (3)

api/transformerlab/compute_providers/local.py
api/transformerlab/routers/compute_provider.py
api/transformerlab/services/local_provider_queue.py

paragon-review · 2026-02-25T16:26:57Z

Paragon Summary

This pull request review identified 2 issues across 2 categories in 3 files. The review analyzed code changes, potential bugs, security vulnerabilities, performance issues, and code quality concerns using automated analysis tools.

This PR fixes "job not found" errors occurring when running with local providers by correcting job path resolution and setting the home directory to be within job_dir/workspace inside local_provider_runs.

Key changes:

Fixes job not found errors when running with local providers
Sets home directory to job_dir/workspace inside local_provider_runs
Modified local compute provider implementation
Updated compute provider router
Updated local provider queue service

Confidence score: 2/5

This PR has high risk due to 1 critical issue that require immediate attention before merge
Score reflects critical security vulnerabilities, data loss risks, or system stability issues
Pay close attention to critical findings and address them before proceeding

3 files reviewed, 2 comments

Severity breakdown: Critical: 1, High: 1

Tip: @paragon-run <instructions> to chat with our agent or push fixes!

Dashboard

paragon-review · 2026-02-25T16:26:58Z

api/transformerlab/services/local_provider_queue.py

        job_id=str(job_id),
        experiment_id=str(experiment_id),
        provider_id=str(provider_id),
+        team_id=str(team_id),


Bug: Stringifying a null team ID creates an invalid organization scope

Stringifying a null team ID creates an invalid organization scope. This completely breaks operations for personal workspaces. Make the team ID optional and pass it without stringification.

View Details

Location: api/transformerlab/services/local_provider_queue.py (lines 49)

Analysis

Stringifying a null team ID creates an invalid organization scope

What fails Personal workspaces will be scoped to a literal 'None' organization.

Result The worker calls set_organization_id('None'), forcing all filesystem operations into orgs/None/workspace.

Expected team_id should be Optional[str] and passed without stringification.

Impact Breaks personal workspaces completely by misrouting file operations to orgs/None/workspace.

How to reproduce

Launch a job without a team (team_id=None).

Patch Details

- team_id=str(team_id),\n+ team_id=team_id,

AI Fix Prompt

Fix this issue: Stringifying a null team ID creates an invalid organization scope. This completely breaks operations for personal workspaces. Make the team ID optional and pass it without stringification. Location: api/transformerlab/services/local_provider_queue.py (lines 49) Problem: Personal workspaces will be scoped to a literal 'None' organization. Current behavior: The worker calls set_organization_id('None'), forcing all filesystem operations into orgs/None/workspace. Expected: team_id should be Optional[str] and passed without stringification. Steps to reproduce: Launch a job without a team (team_id=None). Provide a code fix.

_{Tip: Reply with @paragon-run to automatically fix this issue}

paragon-review · 2026-02-25T16:26:58Z

api/transformerlab/compute_providers/local.py

        env = os.environ.copy()
        env.update(config.env_vars or {})
        env["PATH"] = f"{venv_bin}{os.pathsep}{env.get('PATH', '')}"
+        env["HOME"] = str(workspace_home)


Bug: Overriding HOME breaks shared model and package caches

Overriding HOME breaks shared model and package caches. Ephemeral jobs will waste time redownloading gigabytes of data. Explicitly set host cache paths in the subprocess environment.

View Details

Location: api/transformerlab/compute_providers/local.py (lines 164)

Analysis

Overriding HOME breaks shared model and package caches

What fails Shared models and packages will not be reused across jobs.

Result Models are downloaded to the ephemeral workspace HOME instead of the shared ~/.cache directory.

Expected Preserve default cache directories if they aren't explicitly set in the environment.

Impact Local jobs will waste disk space and time re-downloading gigabytes of models from scratch.

How to reproduce

Launch a local job that downloads Hugging Face models.

Patch Details

- env["HOME"] = str(workspace_home)\n+ if "HF_HOME" not in env:\n+ env["HF_HOME"] = str(Path.home() / ".cache" / "huggingface")\n+ env["HOME"] = str(workspace_home)

AI Fix Prompt

Fix this issue: Overriding HOME breaks shared model and package caches. Ephemeral jobs will waste time redownloading gigabytes of data. Explicitly set host cache paths in the subprocess environment. Location: api/transformerlab/compute_providers/local.py (lines 164) Problem: Shared models and packages will not be reused across jobs. Current behavior: Models are downloaded to the ephemeral workspace HOME instead of the shared ~/.cache directory. Expected: Preserve default cache directories if they aren't explicitly set in the environment. Steps to reproduce: Launch a local job that downloads Hugging Face models. Provide a code fix.

_{Tip: Reply with @paragon-run to automatically fix this issue}

fix job not found errors when running with local providers

1930778

coderabbitai bot reviewed Feb 25, 2026

View reviewed changes

paragon-review bot reviewed Feb 25, 2026

View reviewed changes

deep1401 added 2 commits February 25, 2026 09:47

Merge branch 'main' into fix/running-local-providers

ead2603

Merge branch 'main' into fix/running-local-providers

d0bd213

dadmobile approved these changes Feb 25, 2026

View reviewed changes

deep1401 merged commit 1d8574b into main Feb 25, 2026
9 of 10 checks passed

This was referenced Feb 25, 2026

Add a base venv for local provider and collect computer information #1399

Merged

Move venv inside the workspace folder of the job dir #1400

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix job not found errors when running with local providers#1397

Fix job not found errors when running with local providers#1397
deep1401 merged 3 commits intomainfrom
fix/running-local-providers

deep1401 commented Feb 25, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 25, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

sentry bot commented Feb 25, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

paragon-review bot commented Feb 25, 2026

Uh oh!

paragon-review bot Feb 25, 2026

Uh oh!

paragon-review bot Feb 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


What fails	Personal workspaces will be scoped to a literal 'None' organization.
Result	The worker calls set_organization_id('None'), forcing all filesystem operations into orgs/None/workspace.
Expected	team_id should be Optional[str] and passed without stringification.
Impact	Breaks personal workspaces completely by misrouting file operations to orgs/None/workspace.


What fails	Shared models and packages will not be reused across jobs.
Result	Models are downloaded to the ephemeral workspace HOME instead of the shared ~/.cache directory.
Expected	Preserve default cache directories if they aren't explicitly set in the environment.
Impact	Local jobs will waste disk space and time re-downloading gigabytes of models from scratch.

Uh oh!

Conversation

deep1401 commented Feb 25, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

sentry bot commented Feb 25, 2026

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

paragon-review bot commented Feb 25, 2026

Paragon Summary

Confidence score: 2/5

Uh oh!

paragon-review bot Feb 25, 2026

Choose a reason for hiding this comment

Bug: Stringifying a null team ID creates an invalid organization scope

Analysis

Uh oh!

paragon-review bot Feb 25, 2026

Choose a reason for hiding this comment

Bug: Overriding HOME breaks shared model and package caches

Analysis

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

deep1401 commented Feb 25, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 25, 2026 •

edited

Loading