Skip to content

Fix provider logs for runpod provider#1296

Merged
deep1401 merged 35 commits intomainfrom
fix/logs-runpod-provider
Feb 19, 2026
Merged

Fix provider logs for runpod provider#1296
deep1401 merged 35 commits intomainfrom
fix/logs-runpod-provider

Conversation

@deep1401
Copy link
Copy Markdown
Member

@deep1401 deep1401 commented Feb 4, 2026

  • Also fixes the case where if a job fails runpod keeps restarting the docker container

Summary by CodeRabbit

  • New Features

    • SSH-based log access for RunPod pods with persistent container log files
    • Configurable workspace AWS credentials location propagated to remote hosts
    • Persistent setup-home and centralized PATH for installed tools
  • Bug Fixes

    • More tolerant, non-root-friendly remote setup across galleries
    • Safer SSH host-key handling and hardened key/credentials file operations
    • Improved installer resilience and fallbacks for tool installation
  • Chores

    • Removed legacy per-user virtualenv guidance and consolidated SSH policy handling

@paragon-review
Copy link
Copy Markdown

paragon-review bot commented Feb 4, 2026

Paragon Review Unavailable

Hi @deep1401! To enable Paragon reviews on this repository, please register at https://home.polarity.cc

Once registered, connect your GitHub account and Paragon will automatically review your pull requests.

Comment thread api/transformerlab/routers/experiment/jobs.py Fixed
@sentry
Copy link
Copy Markdown

sentry bot commented Feb 4, 2026

@aliasaria
Copy link
Copy Markdown
Member

@aliasaria to review.

Instructions: set up a runpod provider with Runpod API Key and then launch any task and see if you can see provider logs. Provider logs are necessary for interactive tasks.

pod_data["dockerStartCmd"] = ["sh", "-c", combined_cmd]
# mkdir -p /workspace works with or without a network volume; tee writes run logs there
# Run command, tee output to log file, then always sleep to keep container alive
wrapped_cmd = f"mkdir -p /workspace && (({combined_cmd}) 2>&1 | tee {RUNPOD_RUN_LOGS_PATH}); sleep infinity"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super dumb question so I know how this works: I assume you are doing sleep infinity so the pod is still alive when you are getting logs? I can't figure out how we make sure it gets stopped?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sleep infinity is because runpod has a thing where if you launch like this with their REST API, it keeps repeating the same command again and again killing the docker container and restarting it once it exists with non-zero exit code

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the part I don't understand is how do we make sure the pod is terminated at the end?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so for that part, I am going to modify check-status. Right now if your code running on runpod does not use lab-sdk then check-status can stop the pod once its done but the use with lab-sdk doesnt allow that. I can fix it separately. Skypilot is the only place where we can set that thing where down=True and it will kill the pod once things are done but not for our other providers

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok all this makes me uncomfortable! I like how Runpod works like docker: it will just execute your command and then shutdown the machine. But I understand then we lose logs. But can't we solve that with a trap or some sort of different solution -- doesn't run pod show logs itself?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this is the major issue. Runpod doesnt provide a direct way to access logs. You either access them through the GUI, or you follow something like skypilot where they start a GRPC thing to keep fetching logs. This was the simplest solution to fetch logs.
Also after execution it wont shutdown the machine, it keeps restarting the container.
The current method is easier.
I have a way that I was discussing with @dadmobile where I just do the pod shutdown through check-status in a separate PR

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just clarifying that the sleep infinity part is not to preserve provider logs but to prevent the docker container from restarting again and again. Since we launch with the rest api it gets in a weird state that once the job finishes, it keeps restarting the container since that might be managed by something on RunPod's own gui. So we use sleep infinity to go around that

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also did a PR here to stop a pod on its own once its in terminal state so the issue of RunPod not being killed would be solved by this: #1364

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per discussion, we added runpodctl remove pod $RUNPOD_POD_ID at the end instead of sleep infinity so the pod is killed immediately after execution.
We found the issue is with how runpod creates pods as multiple people compained about the same on the Runpod discord

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 19, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Replaces per-user venv setup with a system-Python + SETUP_HOME model, centralizes SSH host-key policy, adds RunPod SSH-accessible logging with container keep‑alive, and moves AWS/SSH credential paths to a workspace-aware layout while updating routers and gallery scripts.

Changes

Cohort / File(s) Summary
Remote setup & gallery
api/scripts/remote_setup.sh, api/transformerlab/galleries/interactive-gallery.json
Remove venv-based flow; introduce SETUP_HOME for persistent setup-time paths; centralize PATH exports (SETUP_HOME/.local/bin, SETUP_HOME/.cargo/bin); prefer system Python/uv pip with fallbacks; move AWS/SSH/dotfiles/copy-file-mounts under SETUP_HOME; add sudo-aware wrappers and silence non-fatal installer errors.
RunPod provider & logging
api/transformerlab/compute_providers/runpod.py
Add RUNPOD_RUN_LOGS_PATH; expose SSH port (22/tcp); wrap launched commands to tee stdout/stderr to fixed log path and keep container alive; add fetch_runpod_provider_logs async SSH reader that tails/cats the log file using org SSH key + host-key policy.
AWS creds & router plumbing
api/transformerlab/routers/compute_provider.py
Add RUNPOD_AWS_CREDENTIALS_DIR; extend _generate_aws_credentials_setup to accept aws_credentials_dir; write/trim credentials into provided dir, set permissions, propagate AWS_SHARED_CREDENTIALS_FILE for RunPod flows; adjust SSH public-key handling for RunPod.
Job routing & log retrieval
api/transformerlab/routers/experiment/jobs.py
Enhance provider_job_id discovery with multiple fallbacks and job listing; route RunPod log/tunnel retrieval to fetch_runpod_provider_logs; add guards for unsupported list_jobs and fallback behaviors.
SSH policy consolidation
api/transformerlab/compute_providers/slurm.py, api/transformerlab/shared/ssh_policy.py
Extract local AddIfVerified missing-host-key policy into shared/ssh_policy.py as get_add_if_verified_policy; replace SLURM local implementation with shared import to unify host-key handling.

Sequence Diagram(s)

sequenceDiagram
    participant Client as API Client
    participant Router as Experiment Router
    participant JobDB as Job Store
    participant SSHKey as Org SSH Key Store
    participant Pod as RunPod Pod
    participant Logs as RUNPOD_RUN_LOGS_PATH

    Client->>Router: request provider logs / tunnel info
    Router->>JobDB: resolve provider_job_id / list jobs
    JobDB-->>Router: return job candidate (host / cluster_name)
    Router->>SSHKey: load org SSH private key (from RUNPOD_AWS_CREDENTIALS_DIR)
    SSHKey-->>Router: return private key
    Router->>Pod: SSH connect (host:22) using AddIfVerified policy
    Router->>Pod: run tail/cat on Logs path
    Pod-->>Router: return log content
    Router-->>Client: return logs payload
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 From burrows of venv I hopped away,

I planted SETUP_HOME where credentials stay,
RunPods hum and logs I tend,
SSH welcomes keys I send,
A small rabbit cheers—hops on display! 🥕

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Fix provider logs for runpod provider' accurately reflects the primary objective of the PR, which focuses on fixing RunPod provider logs handling and preventing container restarts on job failures.
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/logs-runpod-provider

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
api/transformerlab/routers/compute_provider.py (1)

787-832: 🛠️ Refactor suggestion | 🟠 Major

Move credential setup script generation into the service layer.

This router now embeds provider-specific credential logic. To keep routers thin, please move _generate_aws_credentials_setup and related setup logic into api/transformerlab/services/ and call it from the router.

As per coding guidelines, "api/transformerlab/routers/*.py: Place business logic in api/transformerlab/services/ using the Service pattern; routers should only handle HTTP request/response validation and call services".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/transformerlab/routers/compute_provider.py` around lines 787 - 832,
Extract the _generate_aws_credentials_setup function from the router and move it
into a new or existing Service class under api/transformerlab/services (e.g.,
AwsCredentialsService.generate_setup_script or aws_credentials.py with a
generate_aws_credentials_setup function); keep the same signature and behavior
but remove any router-specific imports, add unit tests for the service, and
export it for reuse. In the router (compute_provider.py) replace the internal
function with an import of the service and call the service method to get the
bash script (pass aws_access_key_id, aws_secret_access_key, aws_profile,
aws_credentials_dir) so the router only handles request/response and delegates
credential creation to the service.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/transformerlab/routers/experiment/jobs.py`:
- Around line 46-131: The _fetch_runpod_provider_logs helper contains
provider-side SSH and log-retrieval business logic and should be moved to a
service under api/transformerlab/services/ (e.g.,
services/runpod_log_service.py); move the entire async function
_fetch_runpod_provider_logs there, keep its signature (provider_instance,
cluster_name, team_id, tail_lines), import RUNPOD_RUN_LOGS_PATH,
get_org_ssh_private_key, and get_add_if_verified_policy in the new module, and
ensure it still returns the same strings/errors; in the router replace the
in-file function with a simple call to the new service function and update
imports (from services.runpod_log_service import _fetch_runpod_provider_logs or
a public name) so the router only handles request/response while all SSH,
paramiko, io, and asyncio.to_thread usage lives in the service.
- Around line 46-51: The helper _fetch_runpod_provider_logs is missing a type
annotation for the provider_instance parameter; add the appropriate type (e.g.,
RunpodProvider, ProviderInstance, or a suitable protocol/ABC used in the
project) to the function signature and import that type at the top of the module
so the signature becomes async def
_fetch_runpod_provider_logs(provider_instance: <ProviderType>, cluster_name:
str, team_id: str, tail_lines: int) -> str, ensuring the chosen type matches the
methods/attributes used from provider_instance inside the function.

In `@api/transformerlab/shared/ssh_policy.py`:
- Around line 4-13: Add type hints for get_add_if_verified_policy and
AddIfVerified.missing_host_key without importing paramiko at runtime: inside the
module add a TYPE_CHECKING guard and import paramiko types for typing only (from
typing import TYPE_CHECKING, Any, Type; if TYPE_CHECKING: import paramiko).
Annotate get_add_if_verified_policy to return
Type["paramiko.MissingHostKeyPolicy"] (or Type[Any] with a string forward ref to
the MissingHostKeyPolicy) and annotate missing_host_key signature as (self,
client: "paramiko.SSHClient", hostname: str, key: "paramiko.PKey") -> None using
string forward references so runtime import is avoided; reference the
AddIfVerified class and missing_host_key method when making these changes.
- Around line 4-5: The import of paramiko inside get_add_if_verified_policy() is
not declared as a dependency, causing runtime failures; update the project's
dependency manifest by adding "paramiko" to the dependencies array in
api/pyproject.toml so the package is installed at runtime and the import in
get_add_if_verified_policy() succeeds.

---

Outside diff comments:
In `@api/transformerlab/routers/compute_provider.py`:
- Around line 787-832: Extract the _generate_aws_credentials_setup function from
the router and move it into a new or existing Service class under
api/transformerlab/services (e.g., AwsCredentialsService.generate_setup_script
or aws_credentials.py with a generate_aws_credentials_setup function); keep the
same signature and behavior but remove any router-specific imports, add unit
tests for the service, and export it for reuse. In the router
(compute_provider.py) replace the internal function with an import of the
service and call the service method to get the bash script (pass
aws_access_key_id, aws_secret_access_key, aws_profile, aws_credentials_dir) so
the router only handles request/response and delegates credential creation to
the service.

Comment on lines +46 to +131
async def _fetch_runpod_provider_logs(
provider_instance,
cluster_name: str,
team_id: str,
tail_lines: int,
) -> str:
"""
For RunPod: get pod public IP, SSH with org key, read /workspace/run_logs.txt and return.
"""
from transformerlab.compute_providers.runpod import RUNPOD_RUN_LOGS_PATH
from transformerlab.services.ssh_key_service import get_org_ssh_private_key

status = provider_instance.get_cluster_status(cluster_name)
provider_data = getattr(status, "provider_data", None) or {}
# RunPod API returns publicIp directly on the pod object, not nested under runtime
# Handle empty string case - pod might still be starting
host = provider_data.get("publicIp") or provider_data.get("public_ip")
if not host or host == "":
runtime = provider_data.get("runtime") or {}
host = runtime.get("publicIp") or runtime.get("public_ip")
# If still no host, check if pod is still starting (desiredStatus might indicate this)
if not host or host == "":
desired_status = provider_data.get("desiredStatus", "")
if desired_status == "RUNNING":
return "Pod is starting up - public IP not yet assigned. Please wait a moment and try again."
return "Pod has no public IP yet (still starting or not available)."

# Get SSH port from portMappings - RunPod maps internal port 22 to an external port
port_mappings = provider_data.get("portMappings") or {}
ssh_port = port_mappings.get("22") or port_mappings.get(22) # Try both string and int key
if not ssh_port:
# Fallback to port 22 if no mapping found (shouldn't happen if port 22 is exposed)
ssh_port = 22

try:
key_bytes = await get_org_ssh_private_key(team_id)
except Exception as e:
return f"Cannot load SSH key for provider logs: {e}"

def _ssh_cat_log() -> str:
import paramiko

try:
pkey = None
# Decode key bytes to string - paramiko's from_private_key expects string content
key_str = key_bytes.decode("utf-8")
key_file = io.StringIO(key_str)

# Try Ed25519 first (modern key type), then RSA
try:
pkey = paramiko.Ed25519Key.from_private_key(key_file)
except (paramiko.ssh_exception.SSHException, ValueError, TypeError, AttributeError):
# Reset file pointer and try RSA
key_file.seek(0)
try:
pkey = paramiko.RSAKey.from_private_key(key_file)
except (paramiko.ssh_exception.SSHException, ValueError, TypeError, AttributeError) as e:
return f"Failed to load SSH key as Ed25519 or RSA: {e}"
except Exception as e:
return f"Failed to load SSH key: {e}"

ssh = paramiko.SSHClient()
ssh.set_missing_host_key_policy(get_add_if_verified_policy())
try:
ssh.connect(
hostname=host,
port=ssh_port,
username="root",
pkey=pkey,
timeout=15,
banner_timeout=15,
)
# Prefer tail to limit size; fallback to cat if tail fails
cmd = f"tail -n {tail_lines} {RUNPOD_RUN_LOGS_PATH} 2>/dev/null || cat {RUNPOD_RUN_LOGS_PATH} 2>/dev/null || echo 'No log file found.'"
stdin, stdout, stderr = ssh.exec_command(cmd, timeout=10)
out = stdout.read().decode("utf-8", errors="replace")
err = stderr.read().decode("utf-8", errors="replace")
if err and "No such file" not in err:
out = out or err
return out.strip() or "No log output yet."
except Exception as e:
return f"SSH failed: {e}"
finally:
ssh.close()

return await asyncio.to_thread(_ssh_cat_log)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Move RunPod SSH log retrieval into the service layer.

This helper performs provider-side SSH and log retrieval, which is business logic. Please move it to api/transformerlab/services/ and keep the router focused on request/response handling.

As per coding guidelines, "api/transformerlab/routers/*.py: Place business logic in api/transformerlab/services/ using the Service pattern; routers should only handle HTTP request/response validation and call services".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/transformerlab/routers/experiment/jobs.py` around lines 46 - 131, The
_fetch_runpod_provider_logs helper contains provider-side SSH and log-retrieval
business logic and should be moved to a service under
api/transformerlab/services/ (e.g., services/runpod_log_service.py); move the
entire async function _fetch_runpod_provider_logs there, keep its signature
(provider_instance, cluster_name, team_id, tail_lines), import
RUNPOD_RUN_LOGS_PATH, get_org_ssh_private_key, and get_add_if_verified_policy in
the new module, and ensure it still returns the same strings/errors; in the
router replace the in-file function with a simple call to the new service
function and update imports (from services.runpod_log_service import
_fetch_runpod_provider_logs or a public name) so the router only handles
request/response while all SSH, paramiko, io, and asyncio.to_thread usage lives
in the service.

Comment on lines +46 to +51
async def _fetch_runpod_provider_logs(
provider_instance,
cluster_name: str,
team_id: str,
tail_lines: int,
) -> str:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a type hint for provider_instance.

This new helper is missing a type annotation for one argument.

🛠️ Suggested fix
-from typing import List, Optional
+from typing import Any, List, Optional
@@
-async def _fetch_runpod_provider_logs(
-    provider_instance,
+async def _fetch_runpod_provider_logs(
+    provider_instance: Any,

As per coding guidelines, "Include mandatory type hints for all function arguments and return types in Python code".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async def _fetch_runpod_provider_logs(
provider_instance,
cluster_name: str,
team_id: str,
tail_lines: int,
) -> str:
async def _fetch_runpod_provider_logs(
provider_instance: Any,
cluster_name: str,
team_id: str,
tail_lines: int,
) -> str:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/transformerlab/routers/experiment/jobs.py` around lines 46 - 51, The
helper _fetch_runpod_provider_logs is missing a type annotation for the
provider_instance parameter; add the appropriate type (e.g., RunpodProvider,
ProviderInstance, or a suitable protocol/ABC used in the project) to the
function signature and import that type at the top of the module so the
signature becomes async def _fetch_runpod_provider_logs(provider_instance:
<ProviderType>, cluster_name: str, team_id: str, tail_lines: int) -> str,
ensuring the chosen type matches the methods/attributes used from
provider_instance inside the function.

Comment on lines +4 to +13
def get_add_if_verified_policy():
import paramiko

class AddIfVerified(paramiko.MissingHostKeyPolicy):
"""Custom SSH host key policy that adds host keys after verification."""

def missing_host_key(self, client, hostname, key):
"""Handle missing host key by adding it to known_hosts after verification."""
client._host_keys.add(hostname, key.get_name(), key)
client._host_keys.save(os.path.expanduser("~/.ssh/known_hosts"))
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add required type hints to the SSH policy helper.

get_add_if_verified_policy and missing_host_key are missing argument/return annotations. Please add typing without forcing a runtime import.

🛠️ Suggested fix
 import os
+from typing import TYPE_CHECKING
+
+if TYPE_CHECKING:
+    import paramiko
@@
-def get_add_if_verified_policy():
+def get_add_if_verified_policy() -> "paramiko.MissingHostKeyPolicy":
     import paramiko
@@
-        def missing_host_key(self, client, hostname, key):
+        def missing_host_key(
+            self,
+            client: "paramiko.SSHClient",
+            hostname: str,
+            key: "paramiko.PKey",
+        ) -> None:

As per coding guidelines, "Include mandatory type hints for all function arguments and return types in Python code".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/transformerlab/shared/ssh_policy.py` around lines 4 - 13, Add type hints
for get_add_if_verified_policy and AddIfVerified.missing_host_key without
importing paramiko at runtime: inside the module add a TYPE_CHECKING guard and
import paramiko types for typing only (from typing import TYPE_CHECKING, Any,
Type; if TYPE_CHECKING: import paramiko). Annotate get_add_if_verified_policy to
return Type["paramiko.MissingHostKeyPolicy"] (or Type[Any] with a string forward
ref to the MissingHostKeyPolicy) and annotate missing_host_key signature as
(self, client: "paramiko.SSHClient", hostname: str, key: "paramiko.PKey") ->
None using string forward references so runtime import is avoided; reference the
AddIfVerified class and missing_host_key method when making these changes.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dont suggest type annotation issues since we've not adopted a type checker yet

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deep1401, understood! I'll stop suggesting type annotation issues since the codebase hasn't adopted a type checker yet. I've removed that from my memory.


⛔ Learnings removed
Learnt from: CR
Repo: transformerlab/transformerlab-app PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-18T14:33:14.026Z
Learning: Applies to **/*.py : Include mandatory type hints for all function arguments and return types in Python code

🧠 Learnings used
Learnt from: CR
Repo: transformerlab/transformerlab-app PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-18T14:33:14.026Z
Learning: Applies to **/*.py : Include mandatory type hints for all function arguments and return types in Python code

Comment thread api/transformerlab/shared/ssh_policy.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
api/transformerlab/routers/experiment/jobs.py (2)

240-246: ⚠️ Potential issue | 🟡 Minor

Add type hint for user_and_team parameter.

The user_and_team parameter is missing a type annotation. Based on the get_user_and_team dependency, this returns a dict with user, team_id, and role.

🛠️ Suggested fix
+from typing import Any, Dict, List, Optional
-from typing import List, Optional
@@
 async def get_provider_job_logs(
     experimentId: str,
     job_id: str,
     tail_lines: int = Query(400, ge=100, le=2000),
-    user_and_team=Depends(get_user_and_team),
+    user_and_team: Dict[str, Any] = Depends(get_user_and_team),
     session: AsyncSession = Depends(get_async_session),
 ):

As per coding guidelines, "Include mandatory type hints for all function arguments and return types in Python code".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/transformerlab/routers/experiment/jobs.py` around lines 240 - 246, The
function get_provider_job_logs is missing a type annotation for the
user_and_team dependency; add a proper type hint matching the return of
get_user_and_team (e.g., a TypedDict/Dict with keys "user", "team_id", and
"role" or a named type like UserAndTeam) so the signature becomes async def
get_provider_job_logs(..., user_and_team: UserAndTeam =
Depends(get_user_and_team), ...); update imports if you introduce a new
TypedDict/alias and keep the rest of the implementation unchanged.

368-374: ⚠️ Potential issue | 🟡 Minor

Add type hint for user_and_team parameter.

Same issue as get_provider_job_logs — the user_and_team parameter needs a type annotation.

🛠️ Suggested fix
 async def get_tunnel_info_for_job(
     experimentId: str,
     job_id: str,
     tail_lines: int = Query(400, ge=100, le=2000),
-    user_and_team=Depends(get_user_and_team),
+    user_and_team: Dict[str, Any] = Depends(get_user_and_team),
     session: AsyncSession = Depends(get_async_session),
 ):

As per coding guidelines, "Include mandatory type hints for all function arguments and return types in Python code".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/transformerlab/routers/experiment/jobs.py` around lines 368 - 374, The
parameter user_and_team in get_tunnel_info_for_job lacks a type annotation; add
the same type hint used in get_provider_job_logs for user_and_team (i.e., mirror
the annotation used there for Depends(get_user_and_team)) so the function
signature for get_tunnel_info_for_job includes a proper type for user_and_team
(and likewise ensure the function return type is annotated per project
guidelines).
🧹 Nitpick comments (2)
api/transformerlab/routers/experiment/jobs.py (2)

331-342: Extract duplicated log-fetching logic into a helper.

This log-fetching pattern (lines 331-342) is duplicated in get_tunnel_info_for_job (lines 443-454). Consider extracting it to reduce duplication.

♻️ Suggested refactor

Extract a helper function:

async def _fetch_provider_logs(
    provider,
    provider_instance,
    cluster_name: str,
    provider_job_id: str | int,
    team_id: str,
    tail_lines: int,
) -> str | bytes:
    """Fetch logs from provider, using RunPod-specific method when applicable."""
    if provider.type == ProviderType.RUNPOD.value:
        return await fetch_runpod_provider_logs(
            provider_instance, cluster_name, team_id, tail_lines
        )
    else:
        return provider_instance.get_job_logs(
            cluster_name,
            provider_job_id,
            tail_lines=tail_lines or None,
            follow=False,
        )

Then replace both occurrences:

     try:
-        if provider.type == ProviderType.RUNPOD.value:
-            raw_logs = await fetch_runpod_provider_logs(
-                provider_instance, cluster_name, user_and_team["team_id"], tail_lines
-            )
-        else:
-            raw_logs = provider_instance.get_job_logs(
-                cluster_name,
-                provider_job_id,
-                tail_lines=tail_lines or None,
-                follow=False,
-            )
+        raw_logs = await _fetch_provider_logs(
+            provider, provider_instance, cluster_name, provider_job_id,
+            user_and_team["team_id"], tail_lines
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/transformerlab/routers/experiment/jobs.py` around lines 331 - 342,
Extract the duplicated provider log-fetching logic into a single async helper
(e.g. _fetch_provider_logs) that takes provider, provider_instance,
cluster_name, provider_job_id, team_id, and tail_lines and returns the raw logs;
inside it use the existing check of provider.type == ProviderType.RUNPOD.value
to call fetch_runpod_provider_logs(provider_instance, cluster_name, team_id,
tail_lines) or otherwise call provider_instance.get_job_logs(cluster_name,
provider_job_id, tail_lines=tail_lines or None, follow=False). Replace the
duplicated blocks in the function containing the shown diff and in
get_tunnel_info_for_job to call this new helper instead.

295-295: Specify element type for the List annotation.

Using raw List without a type parameter reduces type safety. Based on usage, this should be list with appropriate element type.

♻️ Suggested fix
-        provider_jobs: List = []
+        provider_jobs: list = []

Or more precisely with the actual job type if available:

provider_jobs: list[Any] = []
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/transformerlab/routers/experiment/jobs.py` at line 295, The variable
provider_jobs is annotated as the raw typing.List which loses type information;
change its annotation to a concrete generic such as provider_jobs: list[Any] =
[] or, better, provider_jobs: list[YourJobType] = [] using the actual job type
used elsewhere, and add the necessary import (from typing import Any) or the job
class/type import so the annotation is valid and type-safe.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@api/transformerlab/routers/experiment/jobs.py`:
- Around line 240-246: The function get_provider_job_logs is missing a type
annotation for the user_and_team dependency; add a proper type hint matching the
return of get_user_and_team (e.g., a TypedDict/Dict with keys "user", "team_id",
and "role" or a named type like UserAndTeam) so the signature becomes async def
get_provider_job_logs(..., user_and_team: UserAndTeam =
Depends(get_user_and_team), ...); update imports if you introduce a new
TypedDict/alias and keep the rest of the implementation unchanged.
- Around line 368-374: The parameter user_and_team in get_tunnel_info_for_job
lacks a type annotation; add the same type hint used in get_provider_job_logs
for user_and_team (i.e., mirror the annotation used there for
Depends(get_user_and_team)) so the function signature for
get_tunnel_info_for_job includes a proper type for user_and_team (and likewise
ensure the function return type is annotated per project guidelines).

---

Nitpick comments:
In `@api/transformerlab/routers/experiment/jobs.py`:
- Around line 331-342: Extract the duplicated provider log-fetching logic into a
single async helper (e.g. _fetch_provider_logs) that takes provider,
provider_instance, cluster_name, provider_job_id, team_id, and tail_lines and
returns the raw logs; inside it use the existing check of provider.type ==
ProviderType.RUNPOD.value to call fetch_runpod_provider_logs(provider_instance,
cluster_name, team_id, tail_lines) or otherwise call
provider_instance.get_job_logs(cluster_name, provider_job_id,
tail_lines=tail_lines or None, follow=False). Replace the duplicated blocks in
the function containing the shown diff and in get_tunnel_info_for_job to call
this new helper instead.
- Line 295: The variable provider_jobs is annotated as the raw typing.List which
loses type information; change its annotation to a concrete generic such as
provider_jobs: list[Any] = [] or, better, provider_jobs: list[YourJobType] = []
using the actual job type used elsewhere, and add the necessary import (from
typing import Any) or the job class/type import so the annotation is valid and
type-safe.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
api/transformerlab/routers/compute_provider.py (1)

1419-1456: ⚠️ Potential issue | 🟠 Major

Move RunPod SSH setup into a service layer.

This block is provider-specific operational logic and should live in api/transformerlab/services/, with the router only orchestrating request/response flow.

As per coding guidelines, "api/transformerlab/routers/*.py: Place business logic in api/transformerlab/services/ using the Service pattern; routers should only handle HTTP request/response validation and call services".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/transformerlab/routers/compute_provider.py` around lines 1419 - 1456, The
SSH setup logic is provider-specific and should be moved from the router into a
service; create a service (e.g., transformerlab.services.ssh_setup_service) that
exposes a function like prepare_ssh_setup(team_id: str, provider_type: str,
env_vars: dict) -> List[str] (or a tuple of env updates + setup_commands) which:
calls get_or_create_org_ssh_key_pair and get_org_ssh_public_key, sanitizes and
escapes the public key, sets SSH_PUBLIC_KEY for RunPod, and returns the
appropriate ssh_setup command(s) for RunPod vs interactive SSH; then replace the
in-router block (the code referencing ProviderType, env_vars,
public_key_clean/public_key_escaped, and setup_commands) with a single call to
that service function and merge returned env_vars/commands into the router's
env_vars and setup_commands.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@api/transformerlab/routers/compute_provider.py`:
- Around line 1419-1456: The SSH setup logic is provider-specific and should be
moved from the router into a service; create a service (e.g.,
transformerlab.services.ssh_setup_service) that exposes a function like
prepare_ssh_setup(team_id: str, provider_type: str, env_vars: dict) -> List[str]
(or a tuple of env updates + setup_commands) which: calls
get_or_create_org_ssh_key_pair and get_org_ssh_public_key, sanitizes and escapes
the public key, sets SSH_PUBLIC_KEY for RunPod, and returns the appropriate
ssh_setup command(s) for RunPod vs interactive SSH; then replace the in-router
block (the code referencing ProviderType, env_vars,
public_key_clean/public_key_escaped, and setup_commands) with a single call to
that service function and merge returned env_vars/commands into the router's
env_vars and setup_commands.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/transformerlab/compute_providers/runpod.py`:
- Around line 439-447: The long f-strings assigned to wrapped_cmd (for branches
using combined_cmd, config.setup, and config.command) exceed Ruff's 120-char
limit; refactor each assignment in runpod.py so the shell string is split across
multiple concatenated or parenthesized string parts (or use implicit
adjacent-string literals) while preserving the exact shell content and variables
RUNPOD_RUN_LOGS_PATH and RUNPOD_POD_ID; update the three locations where
wrapped_cmd is set (the combined_cmd branch, the config.setup branch, and the
config.command branch) and keep pod_data["dockerStartCmd"] = ["sh", "-c",
wrapped_cmd] unchanged.
- Around line 36-37: The call to provider_instance.get_cluster_status is
synchronous and may block the event loop — run it in a worker thread (e.g., via
asyncio.to_thread or loop.run_in_executor) inside the async function that uses
cluster_name so the event loop remains responsive; replace the direct call to
provider_instance.get_cluster_status(cluster_name) with an awaited
background-thread invocation and then use the returned status to populate
provider_data as before.

Comment thread api/transformerlab/compute_providers/runpod.py Outdated
Comment on lines +439 to +447
wrapped_cmd = f"mkdir -p /workspace && (({combined_cmd}) 2>&1 | tee {RUNPOD_RUN_LOGS_PATH}); runpodctl remove pod $RUNPOD_POD_ID"
pod_data["dockerStartCmd"] = ["sh", "-c", wrapped_cmd]
elif config.setup:
# Just setup, no command
pod_data["dockerStartCmd"] = ["sh", "-c", config.setup]
# For setup-only, still keep container running
wrapped_cmd = f"({config.setup}); runpodctl remove pod $RUNPOD_POD_ID"
pod_data["dockerStartCmd"] = ["sh", "-c", wrapped_cmd]
elif config.command:
# Just command, no setup
pod_data["dockerStartCmd"] = ["sh", "-c", config.command]
wrapped_cmd = f"mkdir -p /workspace && (({config.command}) 2>&1 | tee {RUNPOD_RUN_LOGS_PATH}); runpodctl remove pod $RUNPOD_POD_ID"
pod_data["dockerStartCmd"] = ["sh", "-c", wrapped_cmd]
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Wrap long command strings to satisfy Ruff line-length (120).

These f-strings look over 120 chars. Please wrap them to avoid Ruff violations.

🧹 Suggested formatting
-            wrapped_cmd = f"mkdir -p /workspace && (({combined_cmd}) 2>&1 | tee {RUNPOD_RUN_LOGS_PATH}); runpodctl remove pod $RUNPOD_POD_ID"
+            wrapped_cmd = (
+                f"mkdir -p /workspace && (({combined_cmd}) 2>&1 | tee {RUNPOD_RUN_LOGS_PATH}); "
+                "runpodctl remove pod $RUNPOD_POD_ID"
+            )
...
-            wrapped_cmd = f"mkdir -p /workspace && (({config.command}) 2>&1 | tee {RUNPOD_RUN_LOGS_PATH}); runpodctl remove pod $RUNPOD_POD_ID"
+            wrapped_cmd = (
+                f"mkdir -p /workspace && (({config.command}) 2>&1 | tee {RUNPOD_RUN_LOGS_PATH}); "
+                "runpodctl remove pod $RUNPOD_POD_ID"
+            )

As per coding guidelines, "Use Ruff for Python linting with line-length=120 and indent=4".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
wrapped_cmd = f"mkdir -p /workspace && (({combined_cmd}) 2>&1 | tee {RUNPOD_RUN_LOGS_PATH}); runpodctl remove pod $RUNPOD_POD_ID"
pod_data["dockerStartCmd"] = ["sh", "-c", wrapped_cmd]
elif config.setup:
# Just setup, no command
pod_data["dockerStartCmd"] = ["sh", "-c", config.setup]
# For setup-only, still keep container running
wrapped_cmd = f"({config.setup}); runpodctl remove pod $RUNPOD_POD_ID"
pod_data["dockerStartCmd"] = ["sh", "-c", wrapped_cmd]
elif config.command:
# Just command, no setup
pod_data["dockerStartCmd"] = ["sh", "-c", config.command]
wrapped_cmd = f"mkdir -p /workspace && (({config.command}) 2>&1 | tee {RUNPOD_RUN_LOGS_PATH}); runpodctl remove pod $RUNPOD_POD_ID"
pod_data["dockerStartCmd"] = ["sh", "-c", wrapped_cmd]
wrapped_cmd = (
f"mkdir -p /workspace && (({combined_cmd}) 2>&1 | tee {RUNPOD_RUN_LOGS_PATH}); "
"runpodctl remove pod $RUNPOD_POD_ID"
)
pod_data["dockerStartCmd"] = ["sh", "-c", wrapped_cmd]
elif config.setup:
# For setup-only, still keep container running
wrapped_cmd = f"({config.setup}); runpodctl remove pod $RUNPOD_POD_ID"
pod_data["dockerStartCmd"] = ["sh", "-c", wrapped_cmd]
elif config.command:
wrapped_cmd = (
f"mkdir -p /workspace && (({config.command}) 2>&1 | tee {RUNPOD_RUN_LOGS_PATH}); "
"runpodctl remove pod $RUNPOD_POD_ID"
)
pod_data["dockerStartCmd"] = ["sh", "-c", wrapped_cmd]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/transformerlab/compute_providers/runpod.py` around lines 439 - 447, The
long f-strings assigned to wrapped_cmd (for branches using combined_cmd,
config.setup, and config.command) exceed Ruff's 120-char limit; refactor each
assignment in runpod.py so the shell string is split across multiple
concatenated or parenthesized string parts (or use implicit adjacent-string
literals) while preserving the exact shell content and variables
RUNPOD_RUN_LOGS_PATH and RUNPOD_POD_ID; update the three locations where
wrapped_cmd is set (the combined_cmd branch, the config.setup branch, and the
config.command branch) and keep pod_data["dockerStartCmd"] = ["sh", "-c",
wrapped_cmd] unchanged.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no ruff violations

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash

# Check for Ruff configuration files in the repository
echo "=== Looking for Ruff configuration files ==="
fd -t f -d 3 '(ruff\.toml|\.ruff\.toml|pyproject\.toml)$'

echo ""
echo "=== Checking pyproject.toml for Ruff config ==="
if [ -f "pyproject.toml" ]; then
    cat pyproject.toml | grep -A 30 '\[tool\.ruff'
else
    echo "No pyproject.toml found"
fi

echo ""
echo "=== Checking for ruff.toml ==="
if [ -f "ruff.toml" ]; then
    cat ruff.toml
else
    echo "No ruff.toml found"
fi

echo ""
echo "=== Running Ruff check on the file to see actual violations ==="
rg --files | grep 'api/transformerlab/compute_providers/runpod.py' | head -1 | xargs ruff check --select=E501 2>&1 || echo "No E501 violations found or ruff not available"

echo ""
echo "=== Running Ruff check with all rules enabled to see what's actually flagged ==="
rg --files | grep 'api/transformerlab/compute_providers/runpod.py' | head -1 | xargs ruff check 2>&1 || echo "No violations found or ruff not available"

Length of output: 1061


@deep1401, you're absolutely right. I checked the ruff.toml configuration and confirmed that while line-length = 120 is set, the E501 rule (line-too-long) is not enabled. Ruff only flags line length violations when E501 is explicitly enabled via select or extend-select, which isn't the case in this project's configuration.

My apologies for the incorrect suggestion. The long f-strings in these lines are fine as-is. Thank you for the correction!


🧬 Code Graph Analysis Results

<no_relevant_code_snippets>

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@api/transformerlab/compute_providers/runpod.py`:
- Around line 435-448: The current elif branches are redundant and create
line-length and logging inconsistencies: remove the separate "elif
config.setup:" and "elif config.command:" branches and consolidate command
construction so docker_cmds drives pod_data["dockerStartCmd"]; ensure any
setup-only commands are appended to docker_cmds (so they get handled by the same
path), and always pipe output through RUNPOD_RUN_LOGS_PATH (tee) before the
final "runpodctl remove pod $RUNPOD_POD_ID". Also shorten the long wrapped
command by composing it from smaller pieces (e.g., prefix "mkdir -p /workspace
&&" + the combined command + the tee/removal suffix) so the resulting string
assigned to pod_data["dockerStartCmd"] stays under the 120-char limit while
preserving behavior.

@deep1401 deep1401 merged commit cde5ed6 into main Feb 19, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants