Fix provider logs for runpod provider#1296
Conversation
|
Paragon Review Unavailable Hi @deep1401! To enable Paragon reviews on this repository, please register at https://home.polarity.cc Once registered, connect your GitHub account and Paragon will automatically review your pull requests. |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
|
@aliasaria to review. Instructions: set up a runpod provider with Runpod API Key and then launch any task and see if you can see provider logs. Provider logs are necessary for interactive tasks. |
| pod_data["dockerStartCmd"] = ["sh", "-c", combined_cmd] | ||
| # mkdir -p /workspace works with or without a network volume; tee writes run logs there | ||
| # Run command, tee output to log file, then always sleep to keep container alive | ||
| wrapped_cmd = f"mkdir -p /workspace && (({combined_cmd}) 2>&1 | tee {RUNPOD_RUN_LOGS_PATH}); sleep infinity" |
There was a problem hiding this comment.
Super dumb question so I know how this works: I assume you are doing sleep infinity so the pod is still alive when you are getting logs? I can't figure out how we make sure it gets stopped?
There was a problem hiding this comment.
The sleep infinity is because runpod has a thing where if you launch like this with their REST API, it keeps repeating the same command again and again killing the docker container and restarting it once it exists with non-zero exit code
There was a problem hiding this comment.
Yeah the part I don't understand is how do we make sure the pod is terminated at the end?
There was a problem hiding this comment.
Okay so for that part, I am going to modify check-status. Right now if your code running on runpod does not use lab-sdk then check-status can stop the pod once its done but the use with lab-sdk doesnt allow that. I can fix it separately. Skypilot is the only place where we can set that thing where down=True and it will kill the pod once things are done but not for our other providers
There was a problem hiding this comment.
Ok all this makes me uncomfortable! I like how Runpod works like docker: it will just execute your command and then shutdown the machine. But I understand then we lose logs. But can't we solve that with a trap or some sort of different solution -- doesn't run pod show logs itself?
There was a problem hiding this comment.
No this is the major issue. Runpod doesnt provide a direct way to access logs. You either access them through the GUI, or you follow something like skypilot where they start a GRPC thing to keep fetching logs. This was the simplest solution to fetch logs.
Also after execution it wont shutdown the machine, it keeps restarting the container.
The current method is easier.
I have a way that I was discussing with @dadmobile where I just do the pod shutdown through check-status in a separate PR
There was a problem hiding this comment.
Just clarifying that the sleep infinity part is not to preserve provider logs but to prevent the docker container from restarting again and again. Since we launch with the rest api it gets in a weird state that once the job finishes, it keeps restarting the container since that might be managed by something on RunPod's own gui. So we use sleep infinity to go around that
There was a problem hiding this comment.
Also did a PR here to stop a pod on its own once its in terminal state so the issue of RunPod not being killed would be solved by this: #1364
There was a problem hiding this comment.
As per discussion, we added runpodctl remove pod $RUNPOD_POD_ID at the end instead of sleep infinity so the pod is killed immediately after execution.
We found the issue is with how runpod creates pods as multiple people compained about the same on the Runpod discord
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughReplaces per-user venv setup with a system-Python + SETUP_HOME model, centralizes SSH host-key policy, adds RunPod SSH-accessible logging with container keep‑alive, and moves AWS/SSH credential paths to a workspace-aware layout while updating routers and gallery scripts. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as API Client
participant Router as Experiment Router
participant JobDB as Job Store
participant SSHKey as Org SSH Key Store
participant Pod as RunPod Pod
participant Logs as RUNPOD_RUN_LOGS_PATH
Client->>Router: request provider logs / tunnel info
Router->>JobDB: resolve provider_job_id / list jobs
JobDB-->>Router: return job candidate (host / cluster_name)
Router->>SSHKey: load org SSH private key (from RUNPOD_AWS_CREDENTIALS_DIR)
SSHKey-->>Router: return private key
Router->>Pod: SSH connect (host:22) using AddIfVerified policy
Router->>Pod: run tail/cat on Logs path
Pod-->>Router: return log content
Router-->>Client: return logs payload
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
api/transformerlab/routers/compute_provider.py (1)
787-832: 🛠️ Refactor suggestion | 🟠 MajorMove credential setup script generation into the service layer.
This router now embeds provider-specific credential logic. To keep routers thin, please move
_generate_aws_credentials_setupand related setup logic intoapi/transformerlab/services/and call it from the router.As per coding guidelines, "api/transformerlab/routers/*.py: Place business logic in api/transformerlab/services/ using the Service pattern; routers should only handle HTTP request/response validation and call services".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/transformerlab/routers/compute_provider.py` around lines 787 - 832, Extract the _generate_aws_credentials_setup function from the router and move it into a new or existing Service class under api/transformerlab/services (e.g., AwsCredentialsService.generate_setup_script or aws_credentials.py with a generate_aws_credentials_setup function); keep the same signature and behavior but remove any router-specific imports, add unit tests for the service, and export it for reuse. In the router (compute_provider.py) replace the internal function with an import of the service and call the service method to get the bash script (pass aws_access_key_id, aws_secret_access_key, aws_profile, aws_credentials_dir) so the router only handles request/response and delegates credential creation to the service.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@api/transformerlab/routers/experiment/jobs.py`:
- Around line 46-131: The _fetch_runpod_provider_logs helper contains
provider-side SSH and log-retrieval business logic and should be moved to a
service under api/transformerlab/services/ (e.g.,
services/runpod_log_service.py); move the entire async function
_fetch_runpod_provider_logs there, keep its signature (provider_instance,
cluster_name, team_id, tail_lines), import RUNPOD_RUN_LOGS_PATH,
get_org_ssh_private_key, and get_add_if_verified_policy in the new module, and
ensure it still returns the same strings/errors; in the router replace the
in-file function with a simple call to the new service function and update
imports (from services.runpod_log_service import _fetch_runpod_provider_logs or
a public name) so the router only handles request/response while all SSH,
paramiko, io, and asyncio.to_thread usage lives in the service.
- Around line 46-51: The helper _fetch_runpod_provider_logs is missing a type
annotation for the provider_instance parameter; add the appropriate type (e.g.,
RunpodProvider, ProviderInstance, or a suitable protocol/ABC used in the
project) to the function signature and import that type at the top of the module
so the signature becomes async def
_fetch_runpod_provider_logs(provider_instance: <ProviderType>, cluster_name:
str, team_id: str, tail_lines: int) -> str, ensuring the chosen type matches the
methods/attributes used from provider_instance inside the function.
In `@api/transformerlab/shared/ssh_policy.py`:
- Around line 4-13: Add type hints for get_add_if_verified_policy and
AddIfVerified.missing_host_key without importing paramiko at runtime: inside the
module add a TYPE_CHECKING guard and import paramiko types for typing only (from
typing import TYPE_CHECKING, Any, Type; if TYPE_CHECKING: import paramiko).
Annotate get_add_if_verified_policy to return
Type["paramiko.MissingHostKeyPolicy"] (or Type[Any] with a string forward ref to
the MissingHostKeyPolicy) and annotate missing_host_key signature as (self,
client: "paramiko.SSHClient", hostname: str, key: "paramiko.PKey") -> None using
string forward references so runtime import is avoided; reference the
AddIfVerified class and missing_host_key method when making these changes.
- Around line 4-5: The import of paramiko inside get_add_if_verified_policy() is
not declared as a dependency, causing runtime failures; update the project's
dependency manifest by adding "paramiko" to the dependencies array in
api/pyproject.toml so the package is installed at runtime and the import in
get_add_if_verified_policy() succeeds.
---
Outside diff comments:
In `@api/transformerlab/routers/compute_provider.py`:
- Around line 787-832: Extract the _generate_aws_credentials_setup function from
the router and move it into a new or existing Service class under
api/transformerlab/services (e.g., AwsCredentialsService.generate_setup_script
or aws_credentials.py with a generate_aws_credentials_setup function); keep the
same signature and behavior but remove any router-specific imports, add unit
tests for the service, and export it for reuse. In the router
(compute_provider.py) replace the internal function with an import of the
service and call the service method to get the bash script (pass
aws_access_key_id, aws_secret_access_key, aws_profile, aws_credentials_dir) so
the router only handles request/response and delegates credential creation to
the service.
| async def _fetch_runpod_provider_logs( | ||
| provider_instance, | ||
| cluster_name: str, | ||
| team_id: str, | ||
| tail_lines: int, | ||
| ) -> str: | ||
| """ | ||
| For RunPod: get pod public IP, SSH with org key, read /workspace/run_logs.txt and return. | ||
| """ | ||
| from transformerlab.compute_providers.runpod import RUNPOD_RUN_LOGS_PATH | ||
| from transformerlab.services.ssh_key_service import get_org_ssh_private_key | ||
|
|
||
| status = provider_instance.get_cluster_status(cluster_name) | ||
| provider_data = getattr(status, "provider_data", None) or {} | ||
| # RunPod API returns publicIp directly on the pod object, not nested under runtime | ||
| # Handle empty string case - pod might still be starting | ||
| host = provider_data.get("publicIp") or provider_data.get("public_ip") | ||
| if not host or host == "": | ||
| runtime = provider_data.get("runtime") or {} | ||
| host = runtime.get("publicIp") or runtime.get("public_ip") | ||
| # If still no host, check if pod is still starting (desiredStatus might indicate this) | ||
| if not host or host == "": | ||
| desired_status = provider_data.get("desiredStatus", "") | ||
| if desired_status == "RUNNING": | ||
| return "Pod is starting up - public IP not yet assigned. Please wait a moment and try again." | ||
| return "Pod has no public IP yet (still starting or not available)." | ||
|
|
||
| # Get SSH port from portMappings - RunPod maps internal port 22 to an external port | ||
| port_mappings = provider_data.get("portMappings") or {} | ||
| ssh_port = port_mappings.get("22") or port_mappings.get(22) # Try both string and int key | ||
| if not ssh_port: | ||
| # Fallback to port 22 if no mapping found (shouldn't happen if port 22 is exposed) | ||
| ssh_port = 22 | ||
|
|
||
| try: | ||
| key_bytes = await get_org_ssh_private_key(team_id) | ||
| except Exception as e: | ||
| return f"Cannot load SSH key for provider logs: {e}" | ||
|
|
||
| def _ssh_cat_log() -> str: | ||
| import paramiko | ||
|
|
||
| try: | ||
| pkey = None | ||
| # Decode key bytes to string - paramiko's from_private_key expects string content | ||
| key_str = key_bytes.decode("utf-8") | ||
| key_file = io.StringIO(key_str) | ||
|
|
||
| # Try Ed25519 first (modern key type), then RSA | ||
| try: | ||
| pkey = paramiko.Ed25519Key.from_private_key(key_file) | ||
| except (paramiko.ssh_exception.SSHException, ValueError, TypeError, AttributeError): | ||
| # Reset file pointer and try RSA | ||
| key_file.seek(0) | ||
| try: | ||
| pkey = paramiko.RSAKey.from_private_key(key_file) | ||
| except (paramiko.ssh_exception.SSHException, ValueError, TypeError, AttributeError) as e: | ||
| return f"Failed to load SSH key as Ed25519 or RSA: {e}" | ||
| except Exception as e: | ||
| return f"Failed to load SSH key: {e}" | ||
|
|
||
| ssh = paramiko.SSHClient() | ||
| ssh.set_missing_host_key_policy(get_add_if_verified_policy()) | ||
| try: | ||
| ssh.connect( | ||
| hostname=host, | ||
| port=ssh_port, | ||
| username="root", | ||
| pkey=pkey, | ||
| timeout=15, | ||
| banner_timeout=15, | ||
| ) | ||
| # Prefer tail to limit size; fallback to cat if tail fails | ||
| cmd = f"tail -n {tail_lines} {RUNPOD_RUN_LOGS_PATH} 2>/dev/null || cat {RUNPOD_RUN_LOGS_PATH} 2>/dev/null || echo 'No log file found.'" | ||
| stdin, stdout, stderr = ssh.exec_command(cmd, timeout=10) | ||
| out = stdout.read().decode("utf-8", errors="replace") | ||
| err = stderr.read().decode("utf-8", errors="replace") | ||
| if err and "No such file" not in err: | ||
| out = out or err | ||
| return out.strip() or "No log output yet." | ||
| except Exception as e: | ||
| return f"SSH failed: {e}" | ||
| finally: | ||
| ssh.close() | ||
|
|
||
| return await asyncio.to_thread(_ssh_cat_log) |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Move RunPod SSH log retrieval into the service layer.
This helper performs provider-side SSH and log retrieval, which is business logic. Please move it to api/transformerlab/services/ and keep the router focused on request/response handling.
As per coding guidelines, "api/transformerlab/routers/*.py: Place business logic in api/transformerlab/services/ using the Service pattern; routers should only handle HTTP request/response validation and call services".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@api/transformerlab/routers/experiment/jobs.py` around lines 46 - 131, The
_fetch_runpod_provider_logs helper contains provider-side SSH and log-retrieval
business logic and should be moved to a service under
api/transformerlab/services/ (e.g., services/runpod_log_service.py); move the
entire async function _fetch_runpod_provider_logs there, keep its signature
(provider_instance, cluster_name, team_id, tail_lines), import
RUNPOD_RUN_LOGS_PATH, get_org_ssh_private_key, and get_add_if_verified_policy in
the new module, and ensure it still returns the same strings/errors; in the
router replace the in-file function with a simple call to the new service
function and update imports (from services.runpod_log_service import
_fetch_runpod_provider_logs or a public name) so the router only handles
request/response while all SSH, paramiko, io, and asyncio.to_thread usage lives
in the service.
| async def _fetch_runpod_provider_logs( | ||
| provider_instance, | ||
| cluster_name: str, | ||
| team_id: str, | ||
| tail_lines: int, | ||
| ) -> str: |
There was a problem hiding this comment.
Add a type hint for provider_instance.
This new helper is missing a type annotation for one argument.
🛠️ Suggested fix
-from typing import List, Optional
+from typing import Any, List, Optional
@@
-async def _fetch_runpod_provider_logs(
- provider_instance,
+async def _fetch_runpod_provider_logs(
+ provider_instance: Any,As per coding guidelines, "Include mandatory type hints for all function arguments and return types in Python code".
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async def _fetch_runpod_provider_logs( | |
| provider_instance, | |
| cluster_name: str, | |
| team_id: str, | |
| tail_lines: int, | |
| ) -> str: | |
| async def _fetch_runpod_provider_logs( | |
| provider_instance: Any, | |
| cluster_name: str, | |
| team_id: str, | |
| tail_lines: int, | |
| ) -> str: |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@api/transformerlab/routers/experiment/jobs.py` around lines 46 - 51, The
helper _fetch_runpod_provider_logs is missing a type annotation for the
provider_instance parameter; add the appropriate type (e.g., RunpodProvider,
ProviderInstance, or a suitable protocol/ABC used in the project) to the
function signature and import that type at the top of the module so the
signature becomes async def _fetch_runpod_provider_logs(provider_instance:
<ProviderType>, cluster_name: str, team_id: str, tail_lines: int) -> str,
ensuring the chosen type matches the methods/attributes used from
provider_instance inside the function.
| def get_add_if_verified_policy(): | ||
| import paramiko | ||
|
|
||
| class AddIfVerified(paramiko.MissingHostKeyPolicy): | ||
| """Custom SSH host key policy that adds host keys after verification.""" | ||
|
|
||
| def missing_host_key(self, client, hostname, key): | ||
| """Handle missing host key by adding it to known_hosts after verification.""" | ||
| client._host_keys.add(hostname, key.get_name(), key) | ||
| client._host_keys.save(os.path.expanduser("~/.ssh/known_hosts")) |
There was a problem hiding this comment.
Add required type hints to the SSH policy helper.
get_add_if_verified_policy and missing_host_key are missing argument/return annotations. Please add typing without forcing a runtime import.
🛠️ Suggested fix
import os
+from typing import TYPE_CHECKING
+
+if TYPE_CHECKING:
+ import paramiko
@@
-def get_add_if_verified_policy():
+def get_add_if_verified_policy() -> "paramiko.MissingHostKeyPolicy":
import paramiko
@@
- def missing_host_key(self, client, hostname, key):
+ def missing_host_key(
+ self,
+ client: "paramiko.SSHClient",
+ hostname: str,
+ key: "paramiko.PKey",
+ ) -> None:As per coding guidelines, "Include mandatory type hints for all function arguments and return types in Python code".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@api/transformerlab/shared/ssh_policy.py` around lines 4 - 13, Add type hints
for get_add_if_verified_policy and AddIfVerified.missing_host_key without
importing paramiko at runtime: inside the module add a TYPE_CHECKING guard and
import paramiko types for typing only (from typing import TYPE_CHECKING, Any,
Type; if TYPE_CHECKING: import paramiko). Annotate get_add_if_verified_policy to
return Type["paramiko.MissingHostKeyPolicy"] (or Type[Any] with a string forward
ref to the MissingHostKeyPolicy) and annotate missing_host_key signature as
(self, client: "paramiko.SSHClient", hostname: str, key: "paramiko.PKey") ->
None using string forward references so runtime import is avoided; reference the
AddIfVerified class and missing_host_key method when making these changes.
There was a problem hiding this comment.
Dont suggest type annotation issues since we've not adopted a type checker yet
There was a problem hiding this comment.
@deep1401, understood! I'll stop suggesting type annotation issues since the codebase hasn't adopted a type checker yet. I've removed that from my memory.
⛔ Learnings removed
Learnt from: CR
Repo: transformerlab/transformerlab-app PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-18T14:33:14.026Z
Learning: Applies to **/*.py : Include mandatory type hints for all function arguments and return types in Python code
🧠 Learnings used
Learnt from: CR
Repo: transformerlab/transformerlab-app PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-18T14:33:14.026Z
Learning: Applies to **/*.py : Include mandatory type hints for all function arguments and return types in Python code
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
api/transformerlab/routers/experiment/jobs.py (2)
240-246:⚠️ Potential issue | 🟡 MinorAdd type hint for
user_and_teamparameter.The
user_and_teamparameter is missing a type annotation. Based on theget_user_and_teamdependency, this returns a dict with user, team_id, and role.🛠️ Suggested fix
+from typing import Any, Dict, List, Optional -from typing import List, Optional @@ async def get_provider_job_logs( experimentId: str, job_id: str, tail_lines: int = Query(400, ge=100, le=2000), - user_and_team=Depends(get_user_and_team), + user_and_team: Dict[str, Any] = Depends(get_user_and_team), session: AsyncSession = Depends(get_async_session), ):As per coding guidelines, "Include mandatory type hints for all function arguments and return types in Python code".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/transformerlab/routers/experiment/jobs.py` around lines 240 - 246, The function get_provider_job_logs is missing a type annotation for the user_and_team dependency; add a proper type hint matching the return of get_user_and_team (e.g., a TypedDict/Dict with keys "user", "team_id", and "role" or a named type like UserAndTeam) so the signature becomes async def get_provider_job_logs(..., user_and_team: UserAndTeam = Depends(get_user_and_team), ...); update imports if you introduce a new TypedDict/alias and keep the rest of the implementation unchanged.
368-374:⚠️ Potential issue | 🟡 MinorAdd type hint for
user_and_teamparameter.Same issue as
get_provider_job_logs— theuser_and_teamparameter needs a type annotation.🛠️ Suggested fix
async def get_tunnel_info_for_job( experimentId: str, job_id: str, tail_lines: int = Query(400, ge=100, le=2000), - user_and_team=Depends(get_user_and_team), + user_and_team: Dict[str, Any] = Depends(get_user_and_team), session: AsyncSession = Depends(get_async_session), ):As per coding guidelines, "Include mandatory type hints for all function arguments and return types in Python code".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/transformerlab/routers/experiment/jobs.py` around lines 368 - 374, The parameter user_and_team in get_tunnel_info_for_job lacks a type annotation; add the same type hint used in get_provider_job_logs for user_and_team (i.e., mirror the annotation used there for Depends(get_user_and_team)) so the function signature for get_tunnel_info_for_job includes a proper type for user_and_team (and likewise ensure the function return type is annotated per project guidelines).
🧹 Nitpick comments (2)
api/transformerlab/routers/experiment/jobs.py (2)
331-342: Extract duplicated log-fetching logic into a helper.This log-fetching pattern (lines 331-342) is duplicated in
get_tunnel_info_for_job(lines 443-454). Consider extracting it to reduce duplication.♻️ Suggested refactor
Extract a helper function:
async def _fetch_provider_logs( provider, provider_instance, cluster_name: str, provider_job_id: str | int, team_id: str, tail_lines: int, ) -> str | bytes: """Fetch logs from provider, using RunPod-specific method when applicable.""" if provider.type == ProviderType.RUNPOD.value: return await fetch_runpod_provider_logs( provider_instance, cluster_name, team_id, tail_lines ) else: return provider_instance.get_job_logs( cluster_name, provider_job_id, tail_lines=tail_lines or None, follow=False, )Then replace both occurrences:
try: - if provider.type == ProviderType.RUNPOD.value: - raw_logs = await fetch_runpod_provider_logs( - provider_instance, cluster_name, user_and_team["team_id"], tail_lines - ) - else: - raw_logs = provider_instance.get_job_logs( - cluster_name, - provider_job_id, - tail_lines=tail_lines or None, - follow=False, - ) + raw_logs = await _fetch_provider_logs( + provider, provider_instance, cluster_name, provider_job_id, + user_and_team["team_id"], tail_lines + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/transformerlab/routers/experiment/jobs.py` around lines 331 - 342, Extract the duplicated provider log-fetching logic into a single async helper (e.g. _fetch_provider_logs) that takes provider, provider_instance, cluster_name, provider_job_id, team_id, and tail_lines and returns the raw logs; inside it use the existing check of provider.type == ProviderType.RUNPOD.value to call fetch_runpod_provider_logs(provider_instance, cluster_name, team_id, tail_lines) or otherwise call provider_instance.get_job_logs(cluster_name, provider_job_id, tail_lines=tail_lines or None, follow=False). Replace the duplicated blocks in the function containing the shown diff and in get_tunnel_info_for_job to call this new helper instead.
295-295: Specify element type for theListannotation.Using raw
Listwithout a type parameter reduces type safety. Based on usage, this should belistwith appropriate element type.♻️ Suggested fix
- provider_jobs: List = [] + provider_jobs: list = []Or more precisely with the actual job type if available:
provider_jobs: list[Any] = []🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/transformerlab/routers/experiment/jobs.py` at line 295, The variable provider_jobs is annotated as the raw typing.List which loses type information; change its annotation to a concrete generic such as provider_jobs: list[Any] = [] or, better, provider_jobs: list[YourJobType] = [] using the actual job type used elsewhere, and add the necessary import (from typing import Any) or the job class/type import so the annotation is valid and type-safe.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@api/transformerlab/routers/experiment/jobs.py`:
- Around line 240-246: The function get_provider_job_logs is missing a type
annotation for the user_and_team dependency; add a proper type hint matching the
return of get_user_and_team (e.g., a TypedDict/Dict with keys "user", "team_id",
and "role" or a named type like UserAndTeam) so the signature becomes async def
get_provider_job_logs(..., user_and_team: UserAndTeam =
Depends(get_user_and_team), ...); update imports if you introduce a new
TypedDict/alias and keep the rest of the implementation unchanged.
- Around line 368-374: The parameter user_and_team in get_tunnel_info_for_job
lacks a type annotation; add the same type hint used in get_provider_job_logs
for user_and_team (i.e., mirror the annotation used there for
Depends(get_user_and_team)) so the function signature for
get_tunnel_info_for_job includes a proper type for user_and_team (and likewise
ensure the function return type is annotated per project guidelines).
---
Nitpick comments:
In `@api/transformerlab/routers/experiment/jobs.py`:
- Around line 331-342: Extract the duplicated provider log-fetching logic into a
single async helper (e.g. _fetch_provider_logs) that takes provider,
provider_instance, cluster_name, provider_job_id, team_id, and tail_lines and
returns the raw logs; inside it use the existing check of provider.type ==
ProviderType.RUNPOD.value to call fetch_runpod_provider_logs(provider_instance,
cluster_name, team_id, tail_lines) or otherwise call
provider_instance.get_job_logs(cluster_name, provider_job_id,
tail_lines=tail_lines or None, follow=False). Replace the duplicated blocks in
the function containing the shown diff and in get_tunnel_info_for_job to call
this new helper instead.
- Line 295: The variable provider_jobs is annotated as the raw typing.List which
loses type information; change its annotation to a concrete generic such as
provider_jobs: list[Any] = [] or, better, provider_jobs: list[YourJobType] = []
using the actual job type used elsewhere, and add the necessary import (from
typing import Any) or the job class/type import so the annotation is valid and
type-safe.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
api/transformerlab/routers/compute_provider.py (1)
1419-1456:⚠️ Potential issue | 🟠 MajorMove RunPod SSH setup into a service layer.
This block is provider-specific operational logic and should live in
api/transformerlab/services/, with the router only orchestrating request/response flow.As per coding guidelines, "api/transformerlab/routers/*.py: Place business logic in
api/transformerlab/services/using the Service pattern; routers should only handle HTTP request/response validation and call services".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/transformerlab/routers/compute_provider.py` around lines 1419 - 1456, The SSH setup logic is provider-specific and should be moved from the router into a service; create a service (e.g., transformerlab.services.ssh_setup_service) that exposes a function like prepare_ssh_setup(team_id: str, provider_type: str, env_vars: dict) -> List[str] (or a tuple of env updates + setup_commands) which: calls get_or_create_org_ssh_key_pair and get_org_ssh_public_key, sanitizes and escapes the public key, sets SSH_PUBLIC_KEY for RunPod, and returns the appropriate ssh_setup command(s) for RunPod vs interactive SSH; then replace the in-router block (the code referencing ProviderType, env_vars, public_key_clean/public_key_escaped, and setup_commands) with a single call to that service function and merge returned env_vars/commands into the router's env_vars and setup_commands.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@api/transformerlab/routers/compute_provider.py`:
- Around line 1419-1456: The SSH setup logic is provider-specific and should be
moved from the router into a service; create a service (e.g.,
transformerlab.services.ssh_setup_service) that exposes a function like
prepare_ssh_setup(team_id: str, provider_type: str, env_vars: dict) -> List[str]
(or a tuple of env updates + setup_commands) which: calls
get_or_create_org_ssh_key_pair and get_org_ssh_public_key, sanitizes and escapes
the public key, sets SSH_PUBLIC_KEY for RunPod, and returns the appropriate
ssh_setup command(s) for RunPod vs interactive SSH; then replace the in-router
block (the code referencing ProviderType, env_vars,
public_key_clean/public_key_escaped, and setup_commands) with a single call to
that service function and merge returned env_vars/commands into the router's
env_vars and setup_commands.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@api/transformerlab/compute_providers/runpod.py`:
- Around line 439-447: The long f-strings assigned to wrapped_cmd (for branches
using combined_cmd, config.setup, and config.command) exceed Ruff's 120-char
limit; refactor each assignment in runpod.py so the shell string is split across
multiple concatenated or parenthesized string parts (or use implicit
adjacent-string literals) while preserving the exact shell content and variables
RUNPOD_RUN_LOGS_PATH and RUNPOD_POD_ID; update the three locations where
wrapped_cmd is set (the combined_cmd branch, the config.setup branch, and the
config.command branch) and keep pod_data["dockerStartCmd"] = ["sh", "-c",
wrapped_cmd] unchanged.
- Around line 36-37: The call to provider_instance.get_cluster_status is
synchronous and may block the event loop — run it in a worker thread (e.g., via
asyncio.to_thread or loop.run_in_executor) inside the async function that uses
cluster_name so the event loop remains responsive; replace the direct call to
provider_instance.get_cluster_status(cluster_name) with an awaited
background-thread invocation and then use the returned status to populate
provider_data as before.
| wrapped_cmd = f"mkdir -p /workspace && (({combined_cmd}) 2>&1 | tee {RUNPOD_RUN_LOGS_PATH}); runpodctl remove pod $RUNPOD_POD_ID" | ||
| pod_data["dockerStartCmd"] = ["sh", "-c", wrapped_cmd] | ||
| elif config.setup: | ||
| # Just setup, no command | ||
| pod_data["dockerStartCmd"] = ["sh", "-c", config.setup] | ||
| # For setup-only, still keep container running | ||
| wrapped_cmd = f"({config.setup}); runpodctl remove pod $RUNPOD_POD_ID" | ||
| pod_data["dockerStartCmd"] = ["sh", "-c", wrapped_cmd] | ||
| elif config.command: | ||
| # Just command, no setup | ||
| pod_data["dockerStartCmd"] = ["sh", "-c", config.command] | ||
| wrapped_cmd = f"mkdir -p /workspace && (({config.command}) 2>&1 | tee {RUNPOD_RUN_LOGS_PATH}); runpodctl remove pod $RUNPOD_POD_ID" | ||
| pod_data["dockerStartCmd"] = ["sh", "-c", wrapped_cmd] |
There was a problem hiding this comment.
Wrap long command strings to satisfy Ruff line-length (120).
These f-strings look over 120 chars. Please wrap them to avoid Ruff violations.
🧹 Suggested formatting
- wrapped_cmd = f"mkdir -p /workspace && (({combined_cmd}) 2>&1 | tee {RUNPOD_RUN_LOGS_PATH}); runpodctl remove pod $RUNPOD_POD_ID"
+ wrapped_cmd = (
+ f"mkdir -p /workspace && (({combined_cmd}) 2>&1 | tee {RUNPOD_RUN_LOGS_PATH}); "
+ "runpodctl remove pod $RUNPOD_POD_ID"
+ )
...
- wrapped_cmd = f"mkdir -p /workspace && (({config.command}) 2>&1 | tee {RUNPOD_RUN_LOGS_PATH}); runpodctl remove pod $RUNPOD_POD_ID"
+ wrapped_cmd = (
+ f"mkdir -p /workspace && (({config.command}) 2>&1 | tee {RUNPOD_RUN_LOGS_PATH}); "
+ "runpodctl remove pod $RUNPOD_POD_ID"
+ )As per coding guidelines, "Use Ruff for Python linting with line-length=120 and indent=4".
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| wrapped_cmd = f"mkdir -p /workspace && (({combined_cmd}) 2>&1 | tee {RUNPOD_RUN_LOGS_PATH}); runpodctl remove pod $RUNPOD_POD_ID" | |
| pod_data["dockerStartCmd"] = ["sh", "-c", wrapped_cmd] | |
| elif config.setup: | |
| # Just setup, no command | |
| pod_data["dockerStartCmd"] = ["sh", "-c", config.setup] | |
| # For setup-only, still keep container running | |
| wrapped_cmd = f"({config.setup}); runpodctl remove pod $RUNPOD_POD_ID" | |
| pod_data["dockerStartCmd"] = ["sh", "-c", wrapped_cmd] | |
| elif config.command: | |
| # Just command, no setup | |
| pod_data["dockerStartCmd"] = ["sh", "-c", config.command] | |
| wrapped_cmd = f"mkdir -p /workspace && (({config.command}) 2>&1 | tee {RUNPOD_RUN_LOGS_PATH}); runpodctl remove pod $RUNPOD_POD_ID" | |
| pod_data["dockerStartCmd"] = ["sh", "-c", wrapped_cmd] | |
| wrapped_cmd = ( | |
| f"mkdir -p /workspace && (({combined_cmd}) 2>&1 | tee {RUNPOD_RUN_LOGS_PATH}); " | |
| "runpodctl remove pod $RUNPOD_POD_ID" | |
| ) | |
| pod_data["dockerStartCmd"] = ["sh", "-c", wrapped_cmd] | |
| elif config.setup: | |
| # For setup-only, still keep container running | |
| wrapped_cmd = f"({config.setup}); runpodctl remove pod $RUNPOD_POD_ID" | |
| pod_data["dockerStartCmd"] = ["sh", "-c", wrapped_cmd] | |
| elif config.command: | |
| wrapped_cmd = ( | |
| f"mkdir -p /workspace && (({config.command}) 2>&1 | tee {RUNPOD_RUN_LOGS_PATH}); " | |
| "runpodctl remove pod $RUNPOD_POD_ID" | |
| ) | |
| pod_data["dockerStartCmd"] = ["sh", "-c", wrapped_cmd] |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@api/transformerlab/compute_providers/runpod.py` around lines 439 - 447, The
long f-strings assigned to wrapped_cmd (for branches using combined_cmd,
config.setup, and config.command) exceed Ruff's 120-char limit; refactor each
assignment in runpod.py so the shell string is split across multiple
concatenated or parenthesized string parts (or use implicit adjacent-string
literals) while preserving the exact shell content and variables
RUNPOD_RUN_LOGS_PATH and RUNPOD_POD_ID; update the three locations where
wrapped_cmd is set (the combined_cmd branch, the config.setup branch, and the
config.command branch) and keep pod_data["dockerStartCmd"] = ["sh", "-c",
wrapped_cmd] unchanged.
There was a problem hiding this comment.
There are no ruff violations
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check for Ruff configuration files in the repository
echo "=== Looking for Ruff configuration files ==="
fd -t f -d 3 '(ruff\.toml|\.ruff\.toml|pyproject\.toml)$'
echo ""
echo "=== Checking pyproject.toml for Ruff config ==="
if [ -f "pyproject.toml" ]; then
cat pyproject.toml | grep -A 30 '\[tool\.ruff'
else
echo "No pyproject.toml found"
fi
echo ""
echo "=== Checking for ruff.toml ==="
if [ -f "ruff.toml" ]; then
cat ruff.toml
else
echo "No ruff.toml found"
fi
echo ""
echo "=== Running Ruff check on the file to see actual violations ==="
rg --files | grep 'api/transformerlab/compute_providers/runpod.py' | head -1 | xargs ruff check --select=E501 2>&1 || echo "No E501 violations found or ruff not available"
echo ""
echo "=== Running Ruff check with all rules enabled to see what's actually flagged ==="
rg --files | grep 'api/transformerlab/compute_providers/runpod.py' | head -1 | xargs ruff check 2>&1 || echo "No violations found or ruff not available"
Length of output: 1061
@deep1401, you're absolutely right. I checked the ruff.toml configuration and confirmed that while line-length = 120 is set, the E501 rule (line-too-long) is not enabled. Ruff only flags line length violations when E501 is explicitly enabled via select or extend-select, which isn't the case in this project's configuration.
My apologies for the incorrect suggestion. The long f-strings in these lines are fine as-is. Thank you for the correction!
🧬 Code Graph Analysis Results
<no_relevant_code_snippets>
There was a problem hiding this comment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@api/transformerlab/compute_providers/runpod.py`:
- Around line 435-448: The current elif branches are redundant and create
line-length and logging inconsistencies: remove the separate "elif
config.setup:" and "elif config.command:" branches and consolidate command
construction so docker_cmds drives pod_data["dockerStartCmd"]; ensure any
setup-only commands are appended to docker_cmds (so they get handled by the same
path), and always pipe output through RUNPOD_RUN_LOGS_PATH (tee) before the
final "runpodctl remove pod $RUNPOD_POD_ID". Also shorten the long wrapped
command by composing it from smaller pieces (e.g., prefix "mkdir -p /workspace
&&" + the combined command + the tee/removal suffix) so the resulting string
assigned to pod_data["dockerStartCmd"] stays under the 120-char limit while
preserving behavior.
Summary by CodeRabbit
New Features
Bug Fixes
Chores