Fix provider logs for runpod provider by deep1401 · Pull Request #1296 · transformerlab/transformerlab-app

deep1401 · 2026-02-04T18:59:49Z

Also fixes the case where if a job fails runpod keeps restarting the docker container

Summary by CodeRabbit

New Features
- SSH-based log access for RunPod pods with persistent container log files
- Configurable workspace AWS credentials location propagated to remote hosts
- Persistent setup-home and centralized PATH for installed tools
Bug Fixes
- More tolerant, non-root-friendly remote setup across galleries
- Safer SSH host-key handling and hardened key/credentials file operations
- Improved installer resilience and fallbacks for tool installation
Chores
- Removed legacy per-user virtualenv guidance and consolidated SSH policy handling

paragon-review · 2026-02-04T18:59:52Z

Paragon Review Unavailable

Hi @deep1401! To enable Paragon reviews on this repository, please register at https://home.polarity.cc

Once registered, connect your GitHub account and Paragon will automatically review your pull requests.

sentry · 2026-02-04T19:02:52Z

Codecov Report

❌ Patch coverage is 18.18182% with 90 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
api/transformerlab/compute_providers/runpod.py	12.06%	51 Missing ⚠️
api/transformerlab/routers/compute_provider.py	4.00%	24 Missing ⚠️
api/transformerlab/routers/experiment/jobs.py	55.55%	8 Missing ⚠️
api/transformerlab/shared/ssh_policy.py	25.00%	6 Missing ⚠️
api/transformerlab/compute_providers/slurm.py	0.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

…rmerlab/transformerlab-app into fix/logs-runpod-provider

aliasaria · 2026-02-11T15:39:53Z

@aliasaria to review.

Instructions: set up a runpod provider with Runpod API Key and then launch any task and see if you can see provider logs. Provider logs are necessary for interactive tasks.

…ab-app into fix/logs-runpod-provider

dadmobile · 2026-02-16T18:28:41Z

-            pod_data["dockerStartCmd"] = ["sh", "-c", combined_cmd]
+            # mkdir -p /workspace works with or without a network volume; tee writes run logs there
+            # Run command, tee output to log file, then always sleep to keep container alive
+            wrapped_cmd = f"mkdir -p /workspace && (({combined_cmd}) 2>&1 | tee {RUNPOD_RUN_LOGS_PATH}); sleep infinity"


Super dumb question so I know how this works: I assume you are doing sleep infinity so the pod is still alive when you are getting logs? I can't figure out how we make sure it gets stopped?

The sleep infinity is because runpod has a thing where if you launch like this with their REST API, it keeps repeating the same command again and again killing the docker container and restarting it once it exists with non-zero exit code

Yeah the part I don't understand is how do we make sure the pod is terminated at the end?

Okay so for that part, I am going to modify check-status. Right now if your code running on runpod does not use lab-sdk then check-status can stop the pod once its done but the use with lab-sdk doesnt allow that. I can fix it separately. Skypilot is the only place where we can set that thing where down=True and it will kill the pod once things are done but not for our other providers

Ok all this makes me uncomfortable! I like how Runpod works like docker: it will just execute your command and then shutdown the machine. But I understand then we lose logs. But can't we solve that with a trap or some sort of different solution -- doesn't run pod show logs itself?

No this is the major issue. Runpod doesnt provide a direct way to access logs. You either access them through the GUI, or you follow something like skypilot where they start a GRPC thing to keep fetching logs. This was the simplest solution to fetch logs.
Also after execution it wont shutdown the machine, it keeps restarting the container.
The current method is easier.
I have a way that I was discussing with @dadmobile where I just do the pod shutdown through check-status in a separate PR

Just clarifying that the sleep infinity part is not to preserve provider logs but to prevent the docker container from restarting again and again. Since we launch with the rest api it gets in a weird state that once the job finishes, it keeps restarting the container since that might be managed by something on RunPod's own gui. So we use sleep infinity to go around that

Also did a PR here to stop a pod on its own once its in terminal state so the issue of RunPod not being killed would be solved by this: #1364

As per discussion, we added runpodctl remove pod $RUNPOD_POD_ID at the end instead of sleep infinity so the pod is killed immediately after execution.
We found the issue is with how runpod creates pods as multiple people compained about the same on the Runpod discord

coderabbitai · 2026-02-19T14:40:32Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Replaces per-user venv setup with a system-Python + SETUP_HOME model, centralizes SSH host-key policy, adds RunPod SSH-accessible logging with container keep‑alive, and moves AWS/SSH credential paths to a workspace-aware layout while updating routers and gallery scripts.

Changes

Cohort / File(s)	Summary
Remote setup & gallery `api/scripts/remote_setup.sh`, `api/transformerlab/galleries/interactive-gallery.json`	Remove venv-based flow; introduce `SETUP_HOME` for persistent setup-time paths; centralize PATH exports (`SETUP_HOME/.local/bin`, `SETUP_HOME/.cargo/bin`); prefer system Python/uv pip with fallbacks; move AWS/SSH/dotfiles/copy-file-mounts under `SETUP_HOME`; add sudo-aware wrappers and silence non-fatal installer errors.
RunPod provider & logging `api/transformerlab/compute_providers/runpod.py`	Add `RUNPOD_RUN_LOGS_PATH`; expose SSH port (`22/tcp`); wrap launched commands to tee stdout/stderr to fixed log path and keep container alive; add `fetch_runpod_provider_logs` async SSH reader that tails/cats the log file using org SSH key + host-key policy.
AWS creds & router plumbing `api/transformerlab/routers/compute_provider.py`	Add `RUNPOD_AWS_CREDENTIALS_DIR`; extend `_generate_aws_credentials_setup` to accept `aws_credentials_dir`; write/trim credentials into provided dir, set permissions, propagate `AWS_SHARED_CREDENTIALS_FILE` for RunPod flows; adjust SSH public-key handling for RunPod.
Job routing & log retrieval `api/transformerlab/routers/experiment/jobs.py`	Enhance provider_job_id discovery with multiple fallbacks and job listing; route RunPod log/tunnel retrieval to `fetch_runpod_provider_logs`; add guards for unsupported `list_jobs` and fallback behaviors.
SSH policy consolidation `api/transformerlab/compute_providers/slurm.py`, `api/transformerlab/shared/ssh_policy.py`	Extract local AddIfVerified missing-host-key policy into `shared/ssh_policy.py` as `get_add_if_verified_policy`; replace SLURM local implementation with shared import to unify host-key handling.

Sequence Diagram(s)

sequenceDiagram
    participant Client as API Client
    participant Router as Experiment Router
    participant JobDB as Job Store
    participant SSHKey as Org SSH Key Store
    participant Pod as RunPod Pod
    participant Logs as RUNPOD_RUN_LOGS_PATH

    Client->>Router: request provider logs / tunnel info
    Router->>JobDB: resolve provider_job_id / list jobs
    JobDB-->>Router: return job candidate (host / cluster_name)
    Router->>SSHKey: load org SSH private key (from RUNPOD_AWS_CREDENTIALS_DIR)
    SSHKey-->>Router: return private key
    Router->>Pod: SSH connect (host:22) using AddIfVerified policy
    Router->>Pod: run tail/cat on Logs path
    Pod-->>Router: return log content
    Router-->>Client: return logs payload

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 From burrows of venv I hopped away,

I planted SETUP_HOME where credentials stay,
RunPods hum and logs I tend,
SSH welcomes keys I send,
A small rabbit cheers—hops on display! 🥕

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Fix provider logs for runpod provider' accurately reflects the primary objective of the PR, which focuses on fixing RunPod provider logs handling and preventing container restarts on job failures.
Docstring Coverage	✅ Passed	Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/logs-runpod-provider

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

api/transformerlab/routers/compute_provider.py (1)
787-832: 🛠️ Refactor suggestion | 🟠 Major

Move credential setup script generation into the service layer.

This router now embeds provider-specific credential logic. To keep routers thin, please move _generate_aws_credentials_setup and related setup logic into api/transformerlab/services/ and call it from the router.

As per coding guidelines, "api/transformerlab/routers/*.py: Place business logic in api/transformerlab/services/ using the Service pattern; routers should only handle HTTP request/response validation and call services".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/transformerlab/routers/compute_provider.py` around lines 787 - 832,
Extract the _generate_aws_credentials_setup function from the router and move it
into a new or existing Service class under api/transformerlab/services (e.g.,
AwsCredentialsService.generate_setup_script or aws_credentials.py with a
generate_aws_credentials_setup function); keep the same signature and behavior
but remove any router-specific imports, add unit tests for the service, and
export it for reuse. In the router (compute_provider.py) replace the internal
function with an import of the service and call the service method to get the
bash script (pass aws_access_key_id, aws_secret_access_key, aws_profile,
aws_credentials_dir) so the router only handles request/response and delegates
credential creation to the service.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/transformerlab/routers/experiment/jobs.py`:
- Around line 46-131: The _fetch_runpod_provider_logs helper contains
provider-side SSH and log-retrieval business logic and should be moved to a
service under api/transformerlab/services/ (e.g.,
services/runpod_log_service.py); move the entire async function
_fetch_runpod_provider_logs there, keep its signature (provider_instance,
cluster_name, team_id, tail_lines), import RUNPOD_RUN_LOGS_PATH,
get_org_ssh_private_key, and get_add_if_verified_policy in the new module, and
ensure it still returns the same strings/errors; in the router replace the
in-file function with a simple call to the new service function and update
imports (from services.runpod_log_service import _fetch_runpod_provider_logs or
a public name) so the router only handles request/response while all SSH,
paramiko, io, and asyncio.to_thread usage lives in the service.
- Around line 46-51: The helper _fetch_runpod_provider_logs is missing a type
annotation for the provider_instance parameter; add the appropriate type (e.g.,
RunpodProvider, ProviderInstance, or a suitable protocol/ABC used in the
project) to the function signature and import that type at the top of the module
so the signature becomes async def
_fetch_runpod_provider_logs(provider_instance: <ProviderType>, cluster_name:
str, team_id: str, tail_lines: int) -> str, ensuring the chosen type matches the
methods/attributes used from provider_instance inside the function.

In `@api/transformerlab/shared/ssh_policy.py`:
- Around line 4-13: Add type hints for get_add_if_verified_policy and
AddIfVerified.missing_host_key without importing paramiko at runtime: inside the
module add a TYPE_CHECKING guard and import paramiko types for typing only (from
typing import TYPE_CHECKING, Any, Type; if TYPE_CHECKING: import paramiko).
Annotate get_add_if_verified_policy to return
Type["paramiko.MissingHostKeyPolicy"] (or Type[Any] with a string forward ref to
the MissingHostKeyPolicy) and annotate missing_host_key signature as (self,
client: "paramiko.SSHClient", hostname: str, key: "paramiko.PKey") -> None using
string forward references so runtime import is avoided; reference the
AddIfVerified class and missing_host_key method when making these changes.
- Around line 4-5: The import of paramiko inside get_add_if_verified_policy() is
not declared as a dependency, causing runtime failures; update the project's
dependency manifest by adding "paramiko" to the dependencies array in
api/pyproject.toml so the package is installed at runtime and the import in
get_add_if_verified_policy() succeeds.

---

Outside diff comments:
In `@api/transformerlab/routers/compute_provider.py`:
- Around line 787-832: Extract the _generate_aws_credentials_setup function from
the router and move it into a new or existing Service class under
api/transformerlab/services (e.g., AwsCredentialsService.generate_setup_script
or aws_credentials.py with a generate_aws_credentials_setup function); keep the
same signature and behavior but remove any router-specific imports, add unit
tests for the service, and export it for reuse. In the router
(compute_provider.py) replace the internal function with an import of the
service and call the service method to get the bash script (pass
aws_access_key_id, aws_secret_access_key, aws_profile, aws_credentials_dir) so
the router only handles request/response and delegates credential creation to
the service.

coderabbitai · 2026-02-19T14:51:36Z

+async def _fetch_runpod_provider_logs(
+    provider_instance,
+    cluster_name: str,
+    team_id: str,
+    tail_lines: int,
+) -> str:
+    """
+    For RunPod: get pod public IP, SSH with org key, read /workspace/run_logs.txt and return.
+    """
+    from transformerlab.compute_providers.runpod import RUNPOD_RUN_LOGS_PATH
+    from transformerlab.services.ssh_key_service import get_org_ssh_private_key
+
+    status = provider_instance.get_cluster_status(cluster_name)
+    provider_data = getattr(status, "provider_data", None) or {}
+    # RunPod API returns publicIp directly on the pod object, not nested under runtime
+    # Handle empty string case - pod might still be starting
+    host = provider_data.get("publicIp") or provider_data.get("public_ip")
+    if not host or host == "":
+        runtime = provider_data.get("runtime") or {}
+        host = runtime.get("publicIp") or runtime.get("public_ip")
+    # If still no host, check if pod is still starting (desiredStatus might indicate this)
+    if not host or host == "":
+        desired_status = provider_data.get("desiredStatus", "")
+        if desired_status == "RUNNING":
+            return "Pod is starting up - public IP not yet assigned. Please wait a moment and try again."
+        return "Pod has no public IP yet (still starting or not available)."
+
+    # Get SSH port from portMappings - RunPod maps internal port 22 to an external port
+    port_mappings = provider_data.get("portMappings") or {}
+    ssh_port = port_mappings.get("22") or port_mappings.get(22)  # Try both string and int key
+    if not ssh_port:
+        # Fallback to port 22 if no mapping found (shouldn't happen if port 22 is exposed)
+        ssh_port = 22
+
+    try:
+        key_bytes = await get_org_ssh_private_key(team_id)
+    except Exception as e:
+        return f"Cannot load SSH key for provider logs: {e}"
+
+    def _ssh_cat_log() -> str:
+        import paramiko
+
+        try:
+            pkey = None
+            # Decode key bytes to string - paramiko's from_private_key expects string content
+            key_str = key_bytes.decode("utf-8")
+            key_file = io.StringIO(key_str)
+
+            # Try Ed25519 first (modern key type), then RSA
+            try:
+                pkey = paramiko.Ed25519Key.from_private_key(key_file)
+            except (paramiko.ssh_exception.SSHException, ValueError, TypeError, AttributeError):
+                # Reset file pointer and try RSA
+                key_file.seek(0)
+                try:
+                    pkey = paramiko.RSAKey.from_private_key(key_file)
+                except (paramiko.ssh_exception.SSHException, ValueError, TypeError, AttributeError) as e:
+                    return f"Failed to load SSH key as Ed25519 or RSA: {e}"
+        except Exception as e:
+            return f"Failed to load SSH key: {e}"
+
+        ssh = paramiko.SSHClient()
+        ssh.set_missing_host_key_policy(get_add_if_verified_policy())
+        try:
+            ssh.connect(
+                hostname=host,
+                port=ssh_port,
+                username="root",
+                pkey=pkey,
+                timeout=15,
+                banner_timeout=15,
+            )
+            # Prefer tail to limit size; fallback to cat if tail fails
+            cmd = f"tail -n {tail_lines} {RUNPOD_RUN_LOGS_PATH} 2>/dev/null || cat {RUNPOD_RUN_LOGS_PATH} 2>/dev/null || echo 'No log file found.'"
+            stdin, stdout, stderr = ssh.exec_command(cmd, timeout=10)
+            out = stdout.read().decode("utf-8", errors="replace")
+            err = stderr.read().decode("utf-8", errors="replace")
+            if err and "No such file" not in err:
+                out = out or err
+            return out.strip() or "No log output yet."
+        except Exception as e:
+            return f"SSH failed: {e}"
+        finally:
+            ssh.close()
+
+    return await asyncio.to_thread(_ssh_cat_log)


🛠️ Refactor suggestion | 🟠 Major

Move RunPod SSH log retrieval into the service layer.

This helper performs provider-side SSH and log retrieval, which is business logic. Please move it to api/transformerlab/services/ and keep the router focused on request/response handling.

As per coding guidelines, "api/transformerlab/routers/*.py: Place business logic in api/transformerlab/services/ using the Service pattern; routers should only handle HTTP request/response validation and call services".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@api/transformerlab/routers/experiment/jobs.py` around lines 46 - 131, The _fetch_runpod_provider_logs helper contains provider-side SSH and log-retrieval business logic and should be moved to a service under api/transformerlab/services/ (e.g., services/runpod_log_service.py); move the entire async function _fetch_runpod_provider_logs there, keep its signature (provider_instance, cluster_name, team_id, tail_lines), import RUNPOD_RUN_LOGS_PATH, get_org_ssh_private_key, and get_add_if_verified_policy in the new module, and ensure it still returns the same strings/errors; in the router replace the in-file function with a simple call to the new service function and update imports (from services.runpod_log_service import _fetch_runpod_provider_logs or a public name) so the router only handles request/response while all SSH, paramiko, io, and asyncio.to_thread usage lives in the service.

coderabbitai · 2026-02-19T14:51:36Z

+async def _fetch_runpod_provider_logs(
+    provider_instance,
+    cluster_name: str,
+    team_id: str,
+    tail_lines: int,
+) -> str:


⚠️ Potential issue | 🟡 Minor

Add a type hint for provider_instance.

This new helper is missing a type annotation for one argument.

🛠️ Suggested fix

-from typing import List, Optional +from typing import Any, List, Optional @@ -async def _fetch_runpod_provider_logs( - provider_instance, +async def _fetch_runpod_provider_logs( + provider_instance: Any,

As per coding guidelines, "Include mandatory type hints for all function arguments and return types in Python code".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

async def _fetch_runpod_provider_logs(

provider_instance,

cluster_name: str,

team_id: str,

tail_lines: int,

) -> str:

async def _fetch_runpod_provider_logs(

provider_instance: Any,

cluster_name: str,

team_id: str,

tail_lines: int,

) -> str:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@api/transformerlab/routers/experiment/jobs.py` around lines 46 - 51, The helper _fetch_runpod_provider_logs is missing a type annotation for the provider_instance parameter; add the appropriate type (e.g., RunpodProvider, ProviderInstance, or a suitable protocol/ABC used in the project) to the function signature and import that type at the top of the module so the signature becomes async def _fetch_runpod_provider_logs(provider_instance: <ProviderType>, cluster_name: str, team_id: str, tail_lines: int) -> str, ensuring the chosen type matches the methods/attributes used from provider_instance inside the function.

coderabbitai · 2026-02-19T14:51:36Z

+def get_add_if_verified_policy():
+    import paramiko
+
+    class AddIfVerified(paramiko.MissingHostKeyPolicy):
+        """Custom SSH host key policy that adds host keys after verification."""
+
+        def missing_host_key(self, client, hostname, key):
+            """Handle missing host key by adding it to known_hosts after verification."""
+            client._host_keys.add(hostname, key.get_name(), key)
+            client._host_keys.save(os.path.expanduser("~/.ssh/known_hosts"))


⚠️ Potential issue | 🟡 Minor

Add required type hints to the SSH policy helper.

get_add_if_verified_policy and missing_host_key are missing argument/return annotations. Please add typing without forcing a runtime import.

🛠️ Suggested fix

import os +from typing import TYPE_CHECKING + +if TYPE_CHECKING: + import paramiko @@ -def get_add_if_verified_policy(): +def get_add_if_verified_policy() -> "paramiko.MissingHostKeyPolicy": import paramiko @@ - def missing_host_key(self, client, hostname, key): + def missing_host_key( + self, + client: "paramiko.SSHClient", + hostname: str, + key: "paramiko.PKey", + ) -> None:

As per coding guidelines, "Include mandatory type hints for all function arguments and return types in Python code".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@api/transformerlab/shared/ssh_policy.py` around lines 4 - 13, Add type hints for get_add_if_verified_policy and AddIfVerified.missing_host_key without importing paramiko at runtime: inside the module add a TYPE_CHECKING guard and import paramiko types for typing only (from typing import TYPE_CHECKING, Any, Type; if TYPE_CHECKING: import paramiko). Annotate get_add_if_verified_policy to return Type["paramiko.MissingHostKeyPolicy"] (or Type[Any] with a string forward ref to the MissingHostKeyPolicy) and annotate missing_host_key signature as (self, client: "paramiko.SSHClient", hostname: str, key: "paramiko.PKey") -> None using string forward references so runtime import is avoided; reference the AddIfVerified class and missing_host_key method when making these changes.

Dont suggest type annotation issues since we've not adopted a type checker yet

@deep1401, understood! I'll stop suggesting type annotation issues since the codebase hasn't adopted a type checker yet. I've removed that from my memory.

⛔ Learnings removed

Learnt from: CR Repo: transformerlab/transformerlab-app PR: 0 File: AGENTS.md:0-0 Timestamp: 2026-02-18T14:33:14.026Z Learning: Applies to **/*.py : Include mandatory type hints for all function arguments and return types in Python code

🧠 Learnings used

Learnt from: CR Repo: transformerlab/transformerlab-app PR: 0 File: AGENTS.md:0-0 Timestamp: 2026-02-18T14:33:14.026Z Learning: Applies to **/*.py : Include mandatory type hints for all function arguments and return types in Python code

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

api/transformerlab/routers/experiment/jobs.py (2)

240-246: ⚠️ Potential issue | 🟡 Minor

Add type hint for user_and_team parameter.

The user_and_team parameter is missing a type annotation. Based on the get_user_and_team dependency, this returns a dict with user, team_id, and role.

🛠️ Suggested fix

+from typing import Any, Dict, List, Optional
-from typing import List, Optional
@@
 async def get_provider_job_logs(
     experimentId: str,
     job_id: str,
     tail_lines: int = Query(400, ge=100, le=2000),
-    user_and_team=Depends(get_user_and_team),
+    user_and_team: Dict[str, Any] = Depends(get_user_and_team),
     session: AsyncSession = Depends(get_async_session),
 ):

As per coding guidelines, "Include mandatory type hints for all function arguments and return types in Python code".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@api/transformerlab/routers/experiment/jobs.py` around lines 240 - 246, The
function get_provider_job_logs is missing a type annotation for the
user_and_team dependency; add a proper type hint matching the return of
get_user_and_team (e.g., a TypedDict/Dict with keys "user", "team_id", and
"role" or a named type like UserAndTeam) so the signature becomes async def
get_provider_job_logs(..., user_and_team: UserAndTeam =
Depends(get_user_and_team), ...); update imports if you introduce a new
TypedDict/alias and keep the rest of the implementation unchanged.

368-374: ⚠️ Potential issue | 🟡 Minor

Add type hint for user_and_team parameter.

Same issue as get_provider_job_logs — the user_and_team parameter needs a type annotation.

🛠️ Suggested fix

 async def get_tunnel_info_for_job(
     experimentId: str,
     job_id: str,
     tail_lines: int = Query(400, ge=100, le=2000),
-    user_and_team=Depends(get_user_and_team),
+    user_and_team: Dict[str, Any] = Depends(get_user_and_team),
     session: AsyncSession = Depends(get_async_session),
 ):

As per coding guidelines, "Include mandatory type hints for all function arguments and return types in Python code".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@api/transformerlab/routers/experiment/jobs.py` around lines 368 - 374, The
parameter user_and_team in get_tunnel_info_for_job lacks a type annotation; add
the same type hint used in get_provider_job_logs for user_and_team (i.e., mirror
the annotation used there for Depends(get_user_and_team)) so the function
signature for get_tunnel_info_for_job includes a proper type for user_and_team
(and likewise ensure the function return type is annotated per project
guidelines).

🧹 Nitpick comments (2)

api/transformerlab/routers/experiment/jobs.py (2)

331-342: Extract duplicated log-fetching logic into a helper.

This log-fetching pattern (lines 331-342) is duplicated in get_tunnel_info_for_job (lines 443-454). Consider extracting it to reduce duplication.

♻️ Suggested refactor

Extract a helper function:

async def _fetch_provider_logs(
    provider,
    provider_instance,
    cluster_name: str,
    provider_job_id: str | int,
    team_id: str,
    tail_lines: int,
) -> str | bytes:
    """Fetch logs from provider, using RunPod-specific method when applicable."""
    if provider.type == ProviderType.RUNPOD.value:
        return await fetch_runpod_provider_logs(
            provider_instance, cluster_name, team_id, tail_lines
        )
    else:
        return provider_instance.get_job_logs(
            cluster_name,
            provider_job_id,
            tail_lines=tail_lines or None,
            follow=False,
        )

Then replace both occurrences:

     try:
-        if provider.type == ProviderType.RUNPOD.value:
-            raw_logs = await fetch_runpod_provider_logs(
-                provider_instance, cluster_name, user_and_team["team_id"], tail_lines
-            )
-        else:
-            raw_logs = provider_instance.get_job_logs(
-                cluster_name,
-                provider_job_id,
-                tail_lines=tail_lines or None,
-                follow=False,
-            )
+        raw_logs = await _fetch_provider_logs(
+            provider, provider_instance, cluster_name, provider_job_id,
+            user_and_team["team_id"], tail_lines
+        )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@api/transformerlab/routers/experiment/jobs.py` around lines 331 - 342,
Extract the duplicated provider log-fetching logic into a single async helper
(e.g. _fetch_provider_logs) that takes provider, provider_instance,
cluster_name, provider_job_id, team_id, and tail_lines and returns the raw logs;
inside it use the existing check of provider.type == ProviderType.RUNPOD.value
to call fetch_runpod_provider_logs(provider_instance, cluster_name, team_id,
tail_lines) or otherwise call provider_instance.get_job_logs(cluster_name,
provider_job_id, tail_lines=tail_lines or None, follow=False). Replace the
duplicated blocks in the function containing the shown diff and in
get_tunnel_info_for_job to call this new helper instead.

295-295: Specify element type for the List annotation.

Using raw List without a type parameter reduces type safety. Based on usage, this should be list with appropriate element type.

♻️ Suggested fix

-        provider_jobs: List = []
+        provider_jobs: list = []

Or more precisely with the actual job type if available:

provider_jobs: list[Any] = []

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@api/transformerlab/routers/experiment/jobs.py` at line 295, The variable
provider_jobs is annotated as the raw typing.List which loses type information;
change its annotation to a concrete generic such as provider_jobs: list[Any] =
[] or, better, provider_jobs: list[YourJobType] = [] using the actual job type
used elsewhere, and add the necessary import (from typing import Any) or the job
class/type import so the annotation is valid and type-safe.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@api/transformerlab/routers/experiment/jobs.py`:
- Around line 240-246: The function get_provider_job_logs is missing a type
annotation for the user_and_team dependency; add a proper type hint matching the
return of get_user_and_team (e.g., a TypedDict/Dict with keys "user", "team_id",
and "role" or a named type like UserAndTeam) so the signature becomes async def
get_provider_job_logs(..., user_and_team: UserAndTeam =
Depends(get_user_and_team), ...); update imports if you introduce a new
TypedDict/alias and keep the rest of the implementation unchanged.
- Around line 368-374: The parameter user_and_team in get_tunnel_info_for_job
lacks a type annotation; add the same type hint used in get_provider_job_logs
for user_and_team (i.e., mirror the annotation used there for
Depends(get_user_and_team)) so the function signature for
get_tunnel_info_for_job includes a proper type for user_and_team (and likewise
ensure the function return type is annotated per project guidelines).

---

Nitpick comments:
In `@api/transformerlab/routers/experiment/jobs.py`:
- Around line 331-342: Extract the duplicated provider log-fetching logic into a
single async helper (e.g. _fetch_provider_logs) that takes provider,
provider_instance, cluster_name, provider_job_id, team_id, and tail_lines and
returns the raw logs; inside it use the existing check of provider.type ==
ProviderType.RUNPOD.value to call fetch_runpod_provider_logs(provider_instance,
cluster_name, team_id, tail_lines) or otherwise call
provider_instance.get_job_logs(cluster_name, provider_job_id,
tail_lines=tail_lines or None, follow=False). Replace the duplicated blocks in
the function containing the shown diff and in get_tunnel_info_for_job to call
this new helper instead.
- Line 295: The variable provider_jobs is annotated as the raw typing.List which
loses type information; change its annotation to a concrete generic such as
provider_jobs: list[Any] = [] or, better, provider_jobs: list[YourJobType] = []
using the actual job type used elsewhere, and add the necessary import (from
typing import Any) or the job class/type import so the annotation is valid and
type-safe.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

api/transformerlab/routers/compute_provider.py (1)
1419-1456: ⚠️ Potential issue | 🟠 Major

Move RunPod SSH setup into a service layer.

This block is provider-specific operational logic and should live in api/transformerlab/services/, with the router only orchestrating request/response flow.

As per coding guidelines, "api/transformerlab/routers/*.py: Place business logic in api/transformerlab/services/ using the Service pattern; routers should only handle HTTP request/response validation and call services".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/transformerlab/routers/compute_provider.py` around lines 1419 - 1456, The
SSH setup logic is provider-specific and should be moved from the router into a
service; create a service (e.g., transformerlab.services.ssh_setup_service) that
exposes a function like prepare_ssh_setup(team_id: str, provider_type: str,
env_vars: dict) -> List[str] (or a tuple of env updates + setup_commands) which:
calls get_or_create_org_ssh_key_pair and get_org_ssh_public_key, sanitizes and
escapes the public key, sets SSH_PUBLIC_KEY for RunPod, and returns the
appropriate ssh_setup command(s) for RunPod vs interactive SSH; then replace the
in-router block (the code referencing ProviderType, env_vars,
public_key_clean/public_key_escaped, and setup_commands) with a single call to
that service function and merge returned env_vars/commands into the router's
env_vars and setup_commands.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@api/transformerlab/routers/compute_provider.py`:
- Around line 1419-1456: The SSH setup logic is provider-specific and should be
moved from the router into a service; create a service (e.g.,
transformerlab.services.ssh_setup_service) that exposes a function like
prepare_ssh_setup(team_id: str, provider_type: str, env_vars: dict) -> List[str]
(or a tuple of env updates + setup_commands) which: calls
get_or_create_org_ssh_key_pair and get_org_ssh_public_key, sanitizes and escapes
the public key, sets SSH_PUBLIC_KEY for RunPod, and returns the appropriate
ssh_setup command(s) for RunPod vs interactive SSH; then replace the in-router
block (the code referencing ProviderType, env_vars,
public_key_clean/public_key_escaped, and setup_commands) with a single call to
that service function and merge returned env_vars/commands into the router's
env_vars and setup_commands.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/transformerlab/compute_providers/runpod.py`:
- Around line 439-447: The long f-strings assigned to wrapped_cmd (for branches
using combined_cmd, config.setup, and config.command) exceed Ruff's 120-char
limit; refactor each assignment in runpod.py so the shell string is split across
multiple concatenated or parenthesized string parts (or use implicit
adjacent-string literals) while preserving the exact shell content and variables
RUNPOD_RUN_LOGS_PATH and RUNPOD_POD_ID; update the three locations where
wrapped_cmd is set (the combined_cmd branch, the config.setup branch, and the
config.command branch) and keep pod_data["dockerStartCmd"] = ["sh", "-c",
wrapped_cmd] unchanged.
- Around line 36-37: The call to provider_instance.get_cluster_status is
synchronous and may block the event loop — run it in a worker thread (e.g., via
asyncio.to_thread or loop.run_in_executor) inside the async function that uses
cluster_name so the event loop remains responsive; replace the direct call to
provider_instance.get_cluster_status(cluster_name) with an awaited
background-thread invocation and then use the returned status to populate
provider_data as before.

coderabbitai · 2026-02-19T18:55:30Z

+            wrapped_cmd = f"mkdir -p /workspace && (({combined_cmd}) 2>&1 | tee {RUNPOD_RUN_LOGS_PATH}); runpodctl remove pod $RUNPOD_POD_ID"
+            pod_data["dockerStartCmd"] = ["sh", "-c", wrapped_cmd]
        elif config.setup:
-            # Just setup, no command
-            pod_data["dockerStartCmd"] = ["sh", "-c", config.setup]
+            # For setup-only, still keep container running
+            wrapped_cmd = f"({config.setup}); runpodctl remove pod $RUNPOD_POD_ID"
+            pod_data["dockerStartCmd"] = ["sh", "-c", wrapped_cmd]
        elif config.command:
-            # Just command, no setup
-            pod_data["dockerStartCmd"] = ["sh", "-c", config.command]
+            wrapped_cmd = f"mkdir -p /workspace && (({config.command}) 2>&1 | tee {RUNPOD_RUN_LOGS_PATH}); runpodctl remove pod $RUNPOD_POD_ID"
+            pod_data["dockerStartCmd"] = ["sh", "-c", wrapped_cmd]


⚠️ Potential issue | 🟡 Minor

Wrap long command strings to satisfy Ruff line-length (120).

These f-strings look over 120 chars. Please wrap them to avoid Ruff violations.

🧹 Suggested formatting

- wrapped_cmd = f"mkdir -p /workspace && (({combined_cmd}) 2>&1 | tee {RUNPOD_RUN_LOGS_PATH}); runpodctl remove pod $RUNPOD_POD_ID" + wrapped_cmd = ( + f"mkdir -p /workspace && (({combined_cmd}) 2>&1 | tee {RUNPOD_RUN_LOGS_PATH}); " + "runpodctl remove pod $RUNPOD_POD_ID" + ) ... - wrapped_cmd = f"mkdir -p /workspace && (({config.command}) 2>&1 | tee {RUNPOD_RUN_LOGS_PATH}); runpodctl remove pod $RUNPOD_POD_ID" + wrapped_cmd = ( + f"mkdir -p /workspace && (({config.command}) 2>&1 | tee {RUNPOD_RUN_LOGS_PATH}); " + "runpodctl remove pod $RUNPOD_POD_ID" + )

As per coding guidelines, "Use Ruff for Python linting with line-length=120 and indent=4".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

wrapped_cmd = f"mkdir -p /workspace && (({combined_cmd}) 2>&1 | tee {RUNPOD_RUN_LOGS_PATH}); runpodctl remove pod $RUNPOD_POD_ID"

pod_data["dockerStartCmd"] = ["sh", "-c", wrapped_cmd]

elif config.setup:

# Just setup, no command

pod_data["dockerStartCmd"] = ["sh", "-c", config.setup]

# For setup-only, still keep container running

wrapped_cmd = f"({config.setup}); runpodctl remove pod $RUNPOD_POD_ID"

pod_data["dockerStartCmd"] = ["sh", "-c", wrapped_cmd]

elif config.command:

# Just command, no setup

pod_data["dockerStartCmd"] = ["sh", "-c", config.command]

wrapped_cmd = f"mkdir -p /workspace && (({config.command}) 2>&1 | tee {RUNPOD_RUN_LOGS_PATH}); runpodctl remove pod $RUNPOD_POD_ID"

pod_data["dockerStartCmd"] = ["sh", "-c", wrapped_cmd]

wrapped_cmd = (

f"mkdir -p /workspace && (({combined_cmd}) 2>&1 | tee {RUNPOD_RUN_LOGS_PATH}); "

"runpodctl remove pod $RUNPOD_POD_ID"

)

pod_data["dockerStartCmd"] = ["sh", "-c", wrapped_cmd]

elif config.setup:

# For setup-only, still keep container running

wrapped_cmd = f"({config.setup}); runpodctl remove pod $RUNPOD_POD_ID"

pod_data["dockerStartCmd"] = ["sh", "-c", wrapped_cmd]

elif config.command:

wrapped_cmd = (

f"mkdir -p /workspace && (({config.command}) 2>&1 | tee {RUNPOD_RUN_LOGS_PATH}); "

"runpodctl remove pod $RUNPOD_POD_ID"

)

pod_data["dockerStartCmd"] = ["sh", "-c", wrapped_cmd]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@api/transformerlab/compute_providers/runpod.py` around lines 439 - 447, The long f-strings assigned to wrapped_cmd (for branches using combined_cmd, config.setup, and config.command) exceed Ruff's 120-char limit; refactor each assignment in runpod.py so the shell string is split across multiple concatenated or parenthesized string parts (or use implicit adjacent-string literals) while preserving the exact shell content and variables RUNPOD_RUN_LOGS_PATH and RUNPOD_POD_ID; update the three locations where wrapped_cmd is set (the combined_cmd branch, the config.setup branch, and the config.command branch) and keep pod_data["dockerStartCmd"] = ["sh", "-c", wrapped_cmd] unchanged.

There are no ruff violations

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check for Ruff configuration files in the repository echo "=== Looking for Ruff configuration files ===" fd -t f -d 3 '(ruff\.toml|\.ruff\.toml|pyproject\.toml)$' echo "" echo "=== Checking pyproject.toml for Ruff config ===" if [ -f "pyproject.toml" ]; then cat pyproject.toml | grep -A 30 '\[tool\.ruff' else echo "No pyproject.toml found" fi echo "" echo "=== Checking for ruff.toml ===" if [ -f "ruff.toml" ]; then cat ruff.toml else echo "No ruff.toml found" fi echo "" echo "=== Running Ruff check on the file to see actual violations ===" rg --files | grep 'api/transformerlab/compute_providers/runpod.py' | head -1 | xargs ruff check --select=E501 2>&1 || echo "No E501 violations found or ruff not available" echo "" echo "=== Running Ruff check with all rules enabled to see what's actually flagged ===" rg --files | grep 'api/transformerlab/compute_providers/runpod.py' | head -1 | xargs ruff check 2>&1 || echo "No violations found or ruff not available"

Length of output: 1061

@deep1401, you're absolutely right. I checked the ruff.toml configuration and confirmed that while line-length = 120 is set, the E501 rule (line-too-long) is not enabled. Ruff only flags line length violations when E501 is explicitly enabled via select or extend-select, which isn't the case in this project's configuration.

My apologies for the incorrect suggestion. The long f-strings in these lines are fine as-is. Thank you for the correction!

🧬 Code Graph Analysis Results

<no_relevant_code_snippets>

coderabbitai

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@api/transformerlab/compute_providers/runpod.py`:
- Around line 435-448: The current elif branches are redundant and create
line-length and logging inconsistencies: remove the separate "elif
config.setup:" and "elif config.command:" branches and consolidate command
construction so docker_cmds drives pod_data["dockerStartCmd"]; ensure any
setup-only commands are appended to docker_cmds (so they get handled by the same
path), and always pipe output through RUNPOD_RUN_LOGS_PATH (tee) before the
final "runpodctl remove pod $RUNPOD_POD_ID". Also shorten the long wrapped
command by composing it from smaller pieces (e.g., prefix "mkdir -p /workspace
&&" + the combined command + the tee/removal suffix) so the resulting string
assigned to pod_data["dockerStartCmd"] stays under the 120-char limit while
preserving behavior.

…ab-app into fix/logs-runpod-provider

deep1401 added 5 commits February 4, 2026 23:19

init solution

a3f1665

tiny fix

ee370f6

fix runpod provider logs

e57aa63

ruff format

2f63257

save aws credentials on runpod under /workspace

b419689

Merge branch 'main' into fix/logs-runpod-provider

07527c7

github-advanced-security AI found potential problems Feb 4, 2026

View reviewed changes

Comment thread api/transformerlab/routers/experiment/jobs.py Fixed

deep1401 and others added 9 commits February 5, 2026 00:33

use same policy as slurm

5bd1210

Merge branch 'fix/logs-runpod-provider' of https://github.com/transfo…

d345909

…rmerlab/transformerlab-app into fix/logs-runpod-provider

fix

89b8bb2

Merge branch 'main' into fix/logs-runpod-provider

9a0d00d

Merge branch 'main' into fix/logs-runpod-provider

283d6dd

Merge branch 'main' into fix/logs-runpod-provider

678ee38

error fix

48328dd

add updates to interactive

a2638ad

update remote script as well

9989d1b

deep1401 and others added 9 commits February 11, 2026 10:14

fix remote setup script for correct uv path

68127f0

more remote script fixes

d8f3999

setting setup_home and testing things again

603ea79

Merge branch 'main' of https://github.com/transformerlab/transformerl…

7882b67

…ab-app into fix/logs-runpod-provider

fix

2d90f06

fix tunnel ino

113bb0e

merge conflict

9316bed

Merge branch 'main' into fix/logs-runpod-provider

203bc4f

merge conflict

23fc801

deep1401 mentioned this pull request Feb 13, 2026

Fix runpod interactive such that it runs with runpod provider on root user too transformerlab/galleries#151

Merged

dadmobile reviewed Feb 16, 2026

View reviewed changes

Merge branch 'main' into fix/logs-runpod-provider

caae033

deep1401 added the mode:multiuser label Feb 18, 2026

Merge branch 'main' into fix/logs-runpod-provider

4f92fe5

Merge branch 'main' into fix/logs-runpod-provider

e5ce899