Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
How do I test this? |
There was a problem hiding this comment.
How do I test this?
@aliasaria there is a line in compute_providers.py which appends pip install -q transformerlab to setup commands. You'll have to switch that to: git clone https://github.com/transformerlab/transformerlab-app; cd transformerlab-app/lab-sdk; git checkout fix/make-provider-logs-permanent; pip install -e .; cd ~
Once you do that it'll install this version of lab-sdk and you can check the provider logs by launching any task.
|
Caution Review failedThe pull request is closed. ℹ️ Recent review infoConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds live vs cached provider-log support across CLI SDK, backend API, frontend client, and UI; also bumps package versions. Changes
Sequence DiagramsequenceDiagram
participant User as User
participant UI as Frontend UI
participant Client as API Client
participant API as Backend API
participant Storage as Job Storage
participant Provider as Compute Provider
User->>UI: Toggle "View live provider logs"
UI->>Client: GetProviderLogs(live=true/false)
Client->>API: GET /provider_logs?tail_lines=...&live={flag}
alt live = false
API->>Storage: Read provider_logs.txt
alt exists
Storage-->>API: cached logs
else
API->>Provider: Fetch logs from provider
Provider-->>API: live logs
end
else live = true
API->>Provider: Fetch logs from provider
Provider-->>API: live logs
end
API-->>Client: Logs response
Client-->>UI: Display logs
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
api/transformerlab/routers/experiment/jobs.py (2)
239-250:⚠️ Potential issue | 🟡 MinorCI: Ruff format check failed.
Please run
ruff formaton the affected file(s) to fix the formatting failure.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/transformerlab/routers/experiment/jobs.py` around lines 239 - 250, The file fails Ruff formatting around the get_provider_job_logs route definition; run `ruff format` (or your editor integration) on api/transformerlab/routers/experiment/jobs.py to automatically fix whitespace, line breaks, imports and line-length issues around the get_provider_job_logs function signature and nearby code; after formatting, re-run the linter/CI to confirm the ruff format check passes.
239-261:⚠️ Potential issue | 🟡 MinorAdd return type annotation and Pydantic response model for
get_provider_job_logs.The function currently returns an untyped dict without a
response_modelon the route. Add an explicit return type and create a Pydantic schema inschemas/(e.g.,ProviderJobLogsResponse), then setresponse_modelon the@router.get()decorator to ensure type consistency and proper serialization.The return structure contains:
cluster_name,provider_id,provider_job_id,provider_name,tail_lines,logs(str), andjob_candidates(list of job metadata dicts).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/transformerlab/routers/experiment/jobs.py` around lines 239 - 261, Add a Pydantic response model (e.g., schemas.ProviderJobLogsResponse) and a nested JobCandidate schema to represent job metadata, then annotate get_provider_job_logs with that return type and set response_model=ProviderJobLogsResponse on the `@router.get` decorator; ensure the model includes cluster_name, provider_id, provider_job_id, provider_name, tail_lines (int), logs (str) and job_candidates (List[JobCandidate]), import the schema into the router module, and return an instance (or dict() of the instance) matching the schema so FastAPI can validate/serialize the response.
🧹 Nitpick comments (2)
api/transformerlab/routers/experiment/jobs.py (1)
289-296: Consider streaming the tail to avoid loading the full log into memory.
provider_logs.txtcan grow large; reading it fully just to take the last N lines can be expensive. A streaming tail that keeps only the last N lines in memory would scale better.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/transformerlab/routers/experiment/jobs.py` around lines 289 - 296, The current code reads the entire provider_logs.txt via storage.open and then slices lines to tail_lines; replace this with a streaming tail that reads the file in chunks and keeps only the last N lines in memory (e.g., use an async chunked read or readlines iterator and a fixed-size deque) so provider_logs_path handling does not load the whole file; update the branch that checks tail_lines in the block using storage.exists/storage.open to read in streaming fashion and produce the same logs_text (last N lines) when tail_lines is set, otherwise keep the full read behavior.src/renderer/components/Experiment/Tasks/ViewOutputModalStreaming.tsx (1)
5-5: Add a small component test for the live‑logs toggle.A focused test that flips the checkbox and verifies the live flag is passed (and resets on job change) would lock in the new behavior. As per coding guidelines: "Test frontend components in isolation where possible".
Also applies to: 219-243
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/renderer/components/Experiment/Tasks/ViewOutputModalStreaming.tsx` at line 5, Add a unit test for the ViewOutputModalStreaming component that mounts the component, finds the Checkbox, simulates clicking it to flip the live-logs toggle and asserts the component calls the provided toggle handler (e.g., onToggleLive or similar prop) with the expected true/false values; then update the job identifier prop (simulate job change) and assert the Checkbox resets to its default/off state and the handler is called appropriately. Use React Testing Library to render ViewOutputModalStreaming, fireEvent or userEvent to toggle the Checkbox, mock the onToggle handler to verify calls, and re-render with a different jobId to confirm the reset behavior. Ensure the test covers both the toggle flip and the reset-on-job-change behaviors referenced around lines 219-243.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@api/transformerlab/routers/experiment/jobs.py`:
- Around line 281-308: The file-based provider log retrieval (the block that
imports lab.dirs.get_job_dir, reads provider_logs.txt, truncates tail_lines and
returns the logs object) is business logic and should be moved from the router
into a service; extract this behavior into a new service function (e.g.,
services.jobs.get_provider_logs_from_job_dir or
services.logs.retrieve_provider_logs) that accepts job_id, tail_lines and
storage (and internally calls lab.dirs.get_job_dir/ storage.exists/
storage.open), returns the same dict shape, and handles exceptions/fallbacks; in
api/transformerlab/routers/experiment/jobs.py replace that inlined block with a
call to the new service function and let the router only deal with HTTP concerns
(validation, response formatting, and catching service errors).
In `@lab-sdk/src/lab/remote_trap.py`:
- Around line 124-159: The current capture→echo→persist flow (subprocess.run
called with command_str, producing completed; echoing
completed.stdout/completed.stderr; combining into combined_logs and calling
_write_provider_logs) is intentional for post-execution persistence and live log
access via the API’s live=true; do not refactor to incremental streaming or
Popen merging stdout/stderr now — instead add a short clarifying comment near
the subprocess.run/completed handling (and reference completed, combined_logs,
and _write_provider_logs) stating the design rationale (capture for persistence,
echo for real-time terminal output, live=true for live access) so future
contributors understand why incremental writes/merging were avoided.
---
Outside diff comments:
In `@api/transformerlab/routers/experiment/jobs.py`:
- Around line 239-250: The file fails Ruff formatting around the
get_provider_job_logs route definition; run `ruff format` (or your editor
integration) on api/transformerlab/routers/experiment/jobs.py to automatically
fix whitespace, line breaks, imports and line-length issues around the
get_provider_job_logs function signature and nearby code; after formatting,
re-run the linter/CI to confirm the ruff format check passes.
- Around line 239-261: Add a Pydantic response model (e.g.,
schemas.ProviderJobLogsResponse) and a nested JobCandidate schema to represent
job metadata, then annotate get_provider_job_logs with that return type and set
response_model=ProviderJobLogsResponse on the `@router.get` decorator; ensure the
model includes cluster_name, provider_id, provider_job_id, provider_name,
tail_lines (int), logs (str) and job_candidates (List[JobCandidate]), import the
schema into the router module, and return an instance (or dict() of the
instance) matching the schema so FastAPI can validate/serialize the response.
---
Nitpick comments:
In `@api/transformerlab/routers/experiment/jobs.py`:
- Around line 289-296: The current code reads the entire provider_logs.txt via
storage.open and then slices lines to tail_lines; replace this with a streaming
tail that reads the file in chunks and keeps only the last N lines in memory
(e.g., use an async chunked read or readlines iterator and a fixed-size deque)
so provider_logs_path handling does not load the whole file; update the branch
that checks tail_lines in the block using storage.exists/storage.open to read in
streaming fashion and produce the same logs_text (last N lines) when tail_lines
is set, otherwise keep the full read behavior.
In `@src/renderer/components/Experiment/Tasks/ViewOutputModalStreaming.tsx`:
- Line 5: Add a unit test for the ViewOutputModalStreaming component that mounts
the component, finds the Checkbox, simulates clicking it to flip the live-logs
toggle and asserts the component calls the provided toggle handler (e.g.,
onToggleLive or similar prop) with the expected true/false values; then update
the job identifier prop (simulate job change) and assert the Checkbox resets to
its default/off state and the handler is called appropriately. Use React Testing
Library to render ViewOutputModalStreaming, fireEvent or userEvent to toggle the
Checkbox, mock the onToggle handler to verify calls, and re-render with a
different jobId to confirm the reset behavior. Ensure the test covers both the
toggle flip and the reset-on-job-change behaviors referenced around lines
219-243.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
api/pyproject.tomlapi/transformerlab/routers/experiment/jobs.pylab-sdk/pyproject.tomllab-sdk/src/lab/remote_trap.pysrc/renderer/components/Experiment/Tasks/ViewOutputModalStreaming.tsxsrc/renderer/lib/api-client/endpoints.ts
| # 1) If live=False (default), first try to read provider logs from the job directory | ||
| # via the SDK's job_dir helper. This file is written by tfl-remote-trap inside | ||
| # the remote environment. | ||
| if not live: | ||
| try: | ||
| from lab.dirs import get_job_dir | ||
|
|
||
| job_dir = await get_job_dir(job_id) | ||
| provider_logs_path = storage.join(job_dir, "provider_logs.txt") | ||
| if await storage.exists(provider_logs_path): | ||
| async with await storage.open(provider_logs_path, "r", encoding="utf-8") as f: | ||
| logs_text = await f.read() | ||
| if tail_lines is not None: | ||
| lines = logs_text.splitlines() | ||
| logs_text = "\n".join(lines[-tail_lines:]) | ||
|
|
||
| return { | ||
| "cluster_name": cluster_name, | ||
| "provider_id": provider_id, | ||
| "provider_job_id": None, | ||
| "provider_name": job_data.get("provider_name"), | ||
| "tail_lines": tail_lines, | ||
| "logs": logs_text, | ||
| "job_candidates": [], | ||
| } | ||
| except Exception: | ||
| # If anything goes wrong with the file-based path, fall back to provider-native logs. | ||
| pass |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Move provider‑log retrieval logic into the service layer.
The new file-based retrieval and fallback logic is business behavior that belongs in api/transformerlab/services/, with the router focused on HTTP concerns only. As per coding guidelines: "api/transformerlab/{routers,services}/**/*.py: Implement business logic in api/transformerlab/services/ using the Service Pattern; routers should only handle HTTP validation and call services".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@api/transformerlab/routers/experiment/jobs.py` around lines 281 - 308, The
file-based provider log retrieval (the block that imports lab.dirs.get_job_dir,
reads provider_logs.txt, truncates tail_lines and returns the logs object) is
business logic and should be moved from the router into a service; extract this
behavior into a new service function (e.g.,
services.jobs.get_provider_logs_from_job_dir or
services.logs.retrieve_provider_logs) that accepts job_id, tail_lines and
storage (and internally calls lab.dirs.get_job_dir/ storage.exists/
storage.open), returns the same dict shape, and handles exceptions/fallbacks; in
api/transformerlab/routers/experiment/jobs.py replace that inlined block with a
call to the new service function and let the router only deal with HTTP concerns
(validation, response formatting, and catching service errors).
| # Run the original command in the shell so it behaves exactly as submitted. | ||
| completed = subprocess.run(command_str, shell=True) | ||
| # Capture stdout/stderr so we can save a copy to provider_logs.txt while still | ||
| # echoing output to the current process streams. | ||
| completed = subprocess.run( | ||
| command_str, | ||
| shell=True, | ||
| capture_output=True, | ||
| text=True, | ||
| ) | ||
|
|
||
| # Echo captured output back to the current stdout/stderr so provider-native logs | ||
| # (e.g., SkyPilot, SLURM, RunPod) still see the same content. | ||
| if completed.stdout: | ||
| try: | ||
| sys.stdout.write(completed.stdout) | ||
| sys.stdout.flush() | ||
| except Exception: | ||
| pass | ||
| if completed.stderr: | ||
| try: | ||
| sys.stderr.write(completed.stderr) | ||
| sys.stderr.flush() | ||
| except Exception: | ||
| pass | ||
|
|
||
| # Combine stdout + stderr into a single text blob and store it alongside the job. | ||
| combined_logs_parts: List[str] = [] | ||
| if completed.stdout: | ||
| combined_logs_parts.append(completed.stdout) | ||
| if completed.stderr: | ||
| combined_logs_parts.append(completed.stderr) | ||
| combined_logs = "\n".join(part.rstrip("\n") for part in combined_logs_parts) | ||
|
|
||
| _write_provider_logs(combined_logs) | ||
|
|
||
| exit_code = completed.returncode |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, find and examine the file
git ls-files | grep -E "remote_trap\.py|provider.*log" | head -20Repository: transformerlab/transformerlab-app
Length of output: 107
🏁 Script executed:
# Look for the specific file
fd "remote_trap.py"Repository: transformerlab/transformerlab-app
Length of output: 107
🏁 Script executed:
# If found, check the file size and content around the specified lines
if fd "remote_trap.py" -x wc -l; then
fd "remote_trap.py" -x head -200 > /tmp/file_preview.txt && head -50 /tmp/file_preview.txt
fiRepository: transformerlab/transformerlab-app
Length of output: 1699
🏁 Script executed:
# Check how subprocess.run with capture_output works
python3 << 'EOF'
import subprocess
import sys
# Document the behavior of capture_output
help_text = """
subprocess.run with capture_output=True:
- Returns a CompletedProcess with stdout/stderr as strings
- Output IS buffered until process completes
- Process output is NOT streamed in real-time to terminal/logs
- Large outputs can accumulate in memory
subprocess.Popen approach:
- Process output can be streamed line-by-line in real-time
- Does not accumulate all output in memory at once
- Requires manual stdout/stderr handling
"""
print(help_text)
EOFRepository: transformerlab/transformerlab-app
Length of output: 506
🏁 Script executed:
# Search for all usages of _write_provider_logs to understand how logs are used
rg "_write_provider_logs" -B 5 -A 5Repository: transformerlab/transformerlab-app
Length of output: 3234
🏁 Script executed:
# Look at the broader context of the remote_trap.py file
fd "remote_trap.py" -x cat -n | head -250Repository: transformerlab/transformerlab-app
Length of output: 6817
🏁 Script executed:
# Search for references to live logs or streaming behavior
rg "live|stream|realtime|real-time" lab-sdk/src/lab/ -i -B 2 -A 2Repository: transformerlab/transformerlab-app
Length of output: 5786
🏁 Script executed:
# Check if there's any documentation or issues about log streaming requirements
rg "provider.?log|live.*log|live.*status" lab-sdk/src/lab/ -i -B 2 -A 2Repository: transformerlab/transformerlab-app
Length of output: 6099
🏁 Script executed:
# Check if stderr being separate from stdout matters anywhere else in the codebase
rg "stderr|stdout" lab-sdk/src/lab/ -lRepository: transformerlab/transformerlab-app
Length of output: 131
🏁 Script executed:
# Check for tests or documentation about remote_trap behavior
fd "test.*remote.*trap|remote.*trap.*test" -iRepository: transformerlab/transformerlab-app
Length of output: 59
🏁 Script executed:
# Look for any GitHub issues or comments about live log expectations
rg "live.*log|provider.*log" --type=md --type=txt --type=py -B 2 -A 2Repository: transformerlab/transformerlab-app
Length of output: 20310
🏁 Script executed:
# Check if there are any consumers of provider_logs.txt to understand how critical real-time updates are
rg "provider_logs\.txt" -B 3 -A 3Repository: transformerlab/transformerlab-app
Length of output: 3488
🏁 Script executed:
# Examine the actual memory impact concern - check typical subprocess patterns in the codebase
rg "subprocess\.run|subprocess\.Popen" -B 2 -A 5Repository: transformerlab/transformerlab-app
Length of output: 36052
🏁 Script executed:
# Check if there are any long-running jobs or large output patterns that would motivate the streaming concern
rg "long.?run|stream|live.*log|real.*time" lab-sdk/ -i --type=py -B 1 -A 1 | head -50Repository: transformerlab/transformerlab-app
Length of output: 1566
🏁 Script executed:
# Look at what "live" parameter actually means in the logs endpoint
cat -n api/transformerlab/routers/experiment/jobs.py | sed -n '1,100p'Repository: transformerlab/transformerlab-app
Length of output: 4198
Consider the design intent before refactoring: the current approach intentionally captures logs for post-execution persistence.
The current flow (capture output → echo to stdout/stderr → write provider_logs.txt after completion) is by design. Real-time output already streams to the terminal via lines 136-147. For live log access during execution, the API endpoint has a live=true parameter to fetch logs directly from the compute provider instead of the cached file.
While streaming output line-by-line during execution and writing incrementally to provider_logs.txt could be useful for very long-running jobs with massive outputs, this would:
- Change the error handling model (partial captures if job crashes mid-write)
- Require merging stderr into stdout (the proposed Popen solution uses
stderr=subprocess.STDOUT) - Still buffer all lines in memory (combined_logs_parts), so the memory spike concern isn't fully resolved
If incremental log persistence is desired, clarify whether it should be streamed only during execution, whether stderr should remain separate, and whether partial writes are acceptable on process failure.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@lab-sdk/src/lab/remote_trap.py` around lines 124 - 159, The current
capture→echo→persist flow (subprocess.run called with command_str, producing
completed; echoing completed.stdout/completed.stderr; combining into
combined_logs and calling _write_provider_logs) is intentional for
post-execution persistence and live log access via the API’s live=true; do not
refactor to incremental streaming or Popen merging stdout/stderr now — instead
add a short clarifying comment near the subprocess.run/completed handling (and
reference completed, combined_logs, and _write_provider_logs) stating the design
rationale (capture for persistence, echo for real-time terminal output,
live=true for live access) so future contributors understand why incremental
writes/merging were avoided.
provider_logs.txtSummary by CodeRabbit
New Features
Chores