Skip to content

Make provider logs permanent#1376

Merged
deep1401 merged 6 commits intomainfrom
fix/make-provider-logs-permanent
Feb 24, 2026
Merged

Make provider logs permanent#1376
deep1401 merged 6 commits intomainfrom
fix/make-provider-logs-permanent

Conversation

@deep1401
Copy link
Copy Markdown
Member

@deep1401 deep1401 commented Feb 23, 2026

  • Saves provider logs in the job data as provider_logs.txt
  • Logs are available post job is stopped/finished too
  • Caveat: Only run logs are shown and not the setup ones

Summary by CodeRabbit

  • New Features

    • Added a "View live provider logs" toggle in the experiment output view. By default the app shows persisted (cached) provider logs saved with the job for quicker access; enabling the toggle fetches live logs directly from the compute provider and shows a warning.
  • Chores

    • Updated transformerlab and lab-sdk dependencies to v0.0.82.

@sentry
Copy link
Copy Markdown

sentry bot commented Feb 23, 2026

Codecov Report

❌ Patch coverage is 0% with 14 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
api/transformerlab/routers/experiment/jobs.py 0.00% 14 Missing ⚠️

📢 Thoughts on this report? Let us know!

@aliasaria
Copy link
Copy Markdown
Member

How do I test this?

Copy link
Copy Markdown
Member Author

@deep1401 deep1401 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do I test this?

@aliasaria there is a line in compute_providers.py which appends pip install -q transformerlab to setup commands. You'll have to switch that to: git clone https://github.com/transformerlab/transformerlab-app; cd transformerlab-app/lab-sdk; git checkout fix/make-provider-logs-permanent; pip install -e .; cd ~

Once you do that it'll install this version of lab-sdk and you can check the provider logs by launching any task.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 23, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7261072 and 763c5f6.

📒 Files selected for processing (1)
  • api/transformerlab/routers/experiment/jobs.py

📝 Walkthrough

Walkthrough

Adds live vs cached provider-log support across CLI SDK, backend API, frontend client, and UI; also bumps package versions.

Changes

Cohort / File(s) Summary
Dependency Updates
api/pyproject.toml, lab-sdk/pyproject.toml
Bumped transformerlab / project version from 0.0.810.0.82.
API endpoints
api/transformerlab/routers/experiment/jobs.py
Added live: bool query param to provider log endpoints; when live=false attempts to read provider_logs.txt from job storage, falls back to provider-native logs; live=true forces provider-native fetch.
SDK (remote wrapper)
lab-sdk/src/lab/remote_trap.py
Added storage import and helpers _write_provider_logs_async and _write_provider_logs to persist combined stdout/stderr into provider_logs.txt; command execution now captures stdout/stderr and writes logs.
Frontend client
src/renderer/lib/api-client/endpoints.ts
Extended GetProviderLogs signature with optional live parameter and appends &live=${live} to the provider_logs URL.
Frontend UI
src/renderer/components/Experiment/Tasks/ViewOutputModalStreaming.tsx
Added "View live provider logs" checkbox and viewLiveProviderLogs state; toggling controls whether client requests live or cached logs; adjusted layout/styling for provider logs panel.

Sequence Diagram

sequenceDiagram
    participant User as User
    participant UI as Frontend UI
    participant Client as API Client
    participant API as Backend API
    participant Storage as Job Storage
    participant Provider as Compute Provider

    User->>UI: Toggle "View live provider logs"
    UI->>Client: GetProviderLogs(live=true/false)
    Client->>API: GET /provider_logs?tail_lines=...&live={flag}
    alt live = false
        API->>Storage: Read provider_logs.txt
        alt exists
            Storage-->>API: cached logs
        else
            API->>Provider: Fetch logs from provider
            Provider-->>API: live logs
        end
    else live = true
        API->>Provider: Fetch logs from provider
        Provider-->>API: live logs
    end
    API-->>Client: Logs response
    Client-->>UI: Display logs
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

mode:multiuser

Suggested reviewers

  • aliasaria

Poem

🐰 I scurried to save every stdout and stderr line,
I wrote them to storage so cached logs align.
Flip my toggle for live, or let the file stay—
Hop, fetch, and display: your logs, either way. 🥕

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Make provider logs permanent' accurately captures the main objective of the PR, which is to persist provider logs as a file (provider_logs.txt) in the job directory so they remain available after job completion.
Docstring Coverage ✅ Passed Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/make-provider-logs-permanent

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
api/transformerlab/routers/experiment/jobs.py (2)

239-250: ⚠️ Potential issue | 🟡 Minor

CI: Ruff format check failed.

Please run ruff format on the affected file(s) to fix the formatting failure.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/transformerlab/routers/experiment/jobs.py` around lines 239 - 250, The
file fails Ruff formatting around the get_provider_job_logs route definition;
run `ruff format` (or your editor integration) on
api/transformerlab/routers/experiment/jobs.py to automatically fix whitespace,
line breaks, imports and line-length issues around the get_provider_job_logs
function signature and nearby code; after formatting, re-run the linter/CI to
confirm the ruff format check passes.

239-261: ⚠️ Potential issue | 🟡 Minor

Add return type annotation and Pydantic response model for get_provider_job_logs.

The function currently returns an untyped dict without a response_model on the route. Add an explicit return type and create a Pydantic schema in schemas/ (e.g., ProviderJobLogsResponse), then set response_model on the @router.get() decorator to ensure type consistency and proper serialization.

The return structure contains: cluster_name, provider_id, provider_job_id, provider_name, tail_lines, logs (str), and job_candidates (list of job metadata dicts).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/transformerlab/routers/experiment/jobs.py` around lines 239 - 261, Add a
Pydantic response model (e.g., schemas.ProviderJobLogsResponse) and a nested
JobCandidate schema to represent job metadata, then annotate
get_provider_job_logs with that return type and set
response_model=ProviderJobLogsResponse on the `@router.get` decorator; ensure the
model includes cluster_name, provider_id, provider_job_id, provider_name,
tail_lines (int), logs (str) and job_candidates (List[JobCandidate]), import the
schema into the router module, and return an instance (or dict() of the
instance) matching the schema so FastAPI can validate/serialize the response.
🧹 Nitpick comments (2)
api/transformerlab/routers/experiment/jobs.py (1)

289-296: Consider streaming the tail to avoid loading the full log into memory.

provider_logs.txt can grow large; reading it fully just to take the last N lines can be expensive. A streaming tail that keeps only the last N lines in memory would scale better.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/transformerlab/routers/experiment/jobs.py` around lines 289 - 296, The
current code reads the entire provider_logs.txt via storage.open and then slices
lines to tail_lines; replace this with a streaming tail that reads the file in
chunks and keeps only the last N lines in memory (e.g., use an async chunked
read or readlines iterator and a fixed-size deque) so provider_logs_path
handling does not load the whole file; update the branch that checks tail_lines
in the block using storage.exists/storage.open to read in streaming fashion and
produce the same logs_text (last N lines) when tail_lines is set, otherwise keep
the full read behavior.
src/renderer/components/Experiment/Tasks/ViewOutputModalStreaming.tsx (1)

5-5: Add a small component test for the live‑logs toggle.

A focused test that flips the checkbox and verifies the live flag is passed (and resets on job change) would lock in the new behavior. As per coding guidelines: "Test frontend components in isolation where possible".

Also applies to: 219-243

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/renderer/components/Experiment/Tasks/ViewOutputModalStreaming.tsx` at
line 5, Add a unit test for the ViewOutputModalStreaming component that mounts
the component, finds the Checkbox, simulates clicking it to flip the live-logs
toggle and asserts the component calls the provided toggle handler (e.g.,
onToggleLive or similar prop) with the expected true/false values; then update
the job identifier prop (simulate job change) and assert the Checkbox resets to
its default/off state and the handler is called appropriately. Use React Testing
Library to render ViewOutputModalStreaming, fireEvent or userEvent to toggle the
Checkbox, mock the onToggle handler to verify calls, and re-render with a
different jobId to confirm the reset behavior. Ensure the test covers both the
toggle flip and the reset-on-job-change behaviors referenced around lines
219-243.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/transformerlab/routers/experiment/jobs.py`:
- Around line 281-308: The file-based provider log retrieval (the block that
imports lab.dirs.get_job_dir, reads provider_logs.txt, truncates tail_lines and
returns the logs object) is business logic and should be moved from the router
into a service; extract this behavior into a new service function (e.g.,
services.jobs.get_provider_logs_from_job_dir or
services.logs.retrieve_provider_logs) that accepts job_id, tail_lines and
storage (and internally calls lab.dirs.get_job_dir/ storage.exists/
storage.open), returns the same dict shape, and handles exceptions/fallbacks; in
api/transformerlab/routers/experiment/jobs.py replace that inlined block with a
call to the new service function and let the router only deal with HTTP concerns
(validation, response formatting, and catching service errors).

In `@lab-sdk/src/lab/remote_trap.py`:
- Around line 124-159: The current capture→echo→persist flow (subprocess.run
called with command_str, producing completed; echoing
completed.stdout/completed.stderr; combining into combined_logs and calling
_write_provider_logs) is intentional for post-execution persistence and live log
access via the API’s live=true; do not refactor to incremental streaming or
Popen merging stdout/stderr now — instead add a short clarifying comment near
the subprocess.run/completed handling (and reference completed, combined_logs,
and _write_provider_logs) stating the design rationale (capture for persistence,
echo for real-time terminal output, live=true for live access) so future
contributors understand why incremental writes/merging were avoided.

---

Outside diff comments:
In `@api/transformerlab/routers/experiment/jobs.py`:
- Around line 239-250: The file fails Ruff formatting around the
get_provider_job_logs route definition; run `ruff format` (or your editor
integration) on api/transformerlab/routers/experiment/jobs.py to automatically
fix whitespace, line breaks, imports and line-length issues around the
get_provider_job_logs function signature and nearby code; after formatting,
re-run the linter/CI to confirm the ruff format check passes.
- Around line 239-261: Add a Pydantic response model (e.g.,
schemas.ProviderJobLogsResponse) and a nested JobCandidate schema to represent
job metadata, then annotate get_provider_job_logs with that return type and set
response_model=ProviderJobLogsResponse on the `@router.get` decorator; ensure the
model includes cluster_name, provider_id, provider_job_id, provider_name,
tail_lines (int), logs (str) and job_candidates (List[JobCandidate]), import the
schema into the router module, and return an instance (or dict() of the
instance) matching the schema so FastAPI can validate/serialize the response.

---

Nitpick comments:
In `@api/transformerlab/routers/experiment/jobs.py`:
- Around line 289-296: The current code reads the entire provider_logs.txt via
storage.open and then slices lines to tail_lines; replace this with a streaming
tail that reads the file in chunks and keeps only the last N lines in memory
(e.g., use an async chunked read or readlines iterator and a fixed-size deque)
so provider_logs_path handling does not load the whole file; update the branch
that checks tail_lines in the block using storage.exists/storage.open to read in
streaming fashion and produce the same logs_text (last N lines) when tail_lines
is set, otherwise keep the full read behavior.

In `@src/renderer/components/Experiment/Tasks/ViewOutputModalStreaming.tsx`:
- Line 5: Add a unit test for the ViewOutputModalStreaming component that mounts
the component, finds the Checkbox, simulates clicking it to flip the live-logs
toggle and asserts the component calls the provided toggle handler (e.g.,
onToggleLive or similar prop) with the expected true/false values; then update
the job identifier prop (simulate job change) and assert the Checkbox resets to
its default/off state and the handler is called appropriately. Use React Testing
Library to render ViewOutputModalStreaming, fireEvent or userEvent to toggle the
Checkbox, mock the onToggle handler to verify calls, and re-render with a
different jobId to confirm the reset behavior. Ensure the test covers both the
toggle flip and the reset-on-job-change behaviors referenced around lines
219-243.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dd4d4c0 and 7261072.

📒 Files selected for processing (6)
  • api/pyproject.toml
  • api/transformerlab/routers/experiment/jobs.py
  • lab-sdk/pyproject.toml
  • lab-sdk/src/lab/remote_trap.py
  • src/renderer/components/Experiment/Tasks/ViewOutputModalStreaming.tsx
  • src/renderer/lib/api-client/endpoints.ts

Comment on lines +281 to +308
# 1) If live=False (default), first try to read provider logs from the job directory
# via the SDK's job_dir helper. This file is written by tfl-remote-trap inside
# the remote environment.
if not live:
try:
from lab.dirs import get_job_dir

job_dir = await get_job_dir(job_id)
provider_logs_path = storage.join(job_dir, "provider_logs.txt")
if await storage.exists(provider_logs_path):
async with await storage.open(provider_logs_path, "r", encoding="utf-8") as f:
logs_text = await f.read()
if tail_lines is not None:
lines = logs_text.splitlines()
logs_text = "\n".join(lines[-tail_lines:])

return {
"cluster_name": cluster_name,
"provider_id": provider_id,
"provider_job_id": None,
"provider_name": job_data.get("provider_name"),
"tail_lines": tail_lines,
"logs": logs_text,
"job_candidates": [],
}
except Exception:
# If anything goes wrong with the file-based path, fall back to provider-native logs.
pass
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Move provider‑log retrieval logic into the service layer.

The new file-based retrieval and fallback logic is business behavior that belongs in api/transformerlab/services/, with the router focused on HTTP concerns only. As per coding guidelines: "api/transformerlab/{routers,services}/**/*.py: Implement business logic in api/transformerlab/services/ using the Service Pattern; routers should only handle HTTP validation and call services".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/transformerlab/routers/experiment/jobs.py` around lines 281 - 308, The
file-based provider log retrieval (the block that imports lab.dirs.get_job_dir,
reads provider_logs.txt, truncates tail_lines and returns the logs object) is
business logic and should be moved from the router into a service; extract this
behavior into a new service function (e.g.,
services.jobs.get_provider_logs_from_job_dir or
services.logs.retrieve_provider_logs) that accepts job_id, tail_lines and
storage (and internally calls lab.dirs.get_job_dir/ storage.exists/
storage.open), returns the same dict shape, and handles exceptions/fallbacks; in
api/transformerlab/routers/experiment/jobs.py replace that inlined block with a
call to the new service function and let the router only deal with HTTP concerns
(validation, response formatting, and catching service errors).

Comment on lines 124 to 159
# Run the original command in the shell so it behaves exactly as submitted.
completed = subprocess.run(command_str, shell=True)
# Capture stdout/stderr so we can save a copy to provider_logs.txt while still
# echoing output to the current process streams.
completed = subprocess.run(
command_str,
shell=True,
capture_output=True,
text=True,
)

# Echo captured output back to the current stdout/stderr so provider-native logs
# (e.g., SkyPilot, SLURM, RunPod) still see the same content.
if completed.stdout:
try:
sys.stdout.write(completed.stdout)
sys.stdout.flush()
except Exception:
pass
if completed.stderr:
try:
sys.stderr.write(completed.stderr)
sys.stderr.flush()
except Exception:
pass

# Combine stdout + stderr into a single text blob and store it alongside the job.
combined_logs_parts: List[str] = []
if completed.stdout:
combined_logs_parts.append(completed.stdout)
if completed.stderr:
combined_logs_parts.append(completed.stderr)
combined_logs = "\n".join(part.rstrip("\n") for part in combined_logs_parts)

_write_provider_logs(combined_logs)

exit_code = completed.returncode
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, find and examine the file
git ls-files | grep -E "remote_trap\.py|provider.*log" | head -20

Repository: transformerlab/transformerlab-app

Length of output: 107


🏁 Script executed:

# Look for the specific file
fd "remote_trap.py"

Repository: transformerlab/transformerlab-app

Length of output: 107


🏁 Script executed:

# If found, check the file size and content around the specified lines
if fd "remote_trap.py" -x wc -l; then
  fd "remote_trap.py" -x head -200 > /tmp/file_preview.txt && head -50 /tmp/file_preview.txt
fi

Repository: transformerlab/transformerlab-app

Length of output: 1699


🏁 Script executed:

# Check how subprocess.run with capture_output works
python3 << 'EOF'
import subprocess
import sys

# Document the behavior of capture_output
help_text = """
subprocess.run with capture_output=True:
- Returns a CompletedProcess with stdout/stderr as strings
- Output IS buffered until process completes
- Process output is NOT streamed in real-time to terminal/logs
- Large outputs can accumulate in memory

subprocess.Popen approach:
- Process output can be streamed line-by-line in real-time
- Does not accumulate all output in memory at once
- Requires manual stdout/stderr handling
"""
print(help_text)
EOF

Repository: transformerlab/transformerlab-app

Length of output: 506


🏁 Script executed:

# Search for all usages of _write_provider_logs to understand how logs are used
rg "_write_provider_logs" -B 5 -A 5

Repository: transformerlab/transformerlab-app

Length of output: 3234


🏁 Script executed:

# Look at the broader context of the remote_trap.py file
fd "remote_trap.py" -x cat -n | head -250

Repository: transformerlab/transformerlab-app

Length of output: 6817


🏁 Script executed:

# Search for references to live logs or streaming behavior
rg "live|stream|realtime|real-time" lab-sdk/src/lab/ -i -B 2 -A 2

Repository: transformerlab/transformerlab-app

Length of output: 5786


🏁 Script executed:

# Check if there's any documentation or issues about log streaming requirements
rg "provider.?log|live.*log|live.*status" lab-sdk/src/lab/ -i -B 2 -A 2

Repository: transformerlab/transformerlab-app

Length of output: 6099


🏁 Script executed:

# Check if stderr being separate from stdout matters anywhere else in the codebase
rg "stderr|stdout" lab-sdk/src/lab/ -l

Repository: transformerlab/transformerlab-app

Length of output: 131


🏁 Script executed:

# Check for tests or documentation about remote_trap behavior
fd "test.*remote.*trap|remote.*trap.*test" -i

Repository: transformerlab/transformerlab-app

Length of output: 59


🏁 Script executed:

# Look for any GitHub issues or comments about live log expectations
rg "live.*log|provider.*log" --type=md --type=txt --type=py -B 2 -A 2

Repository: transformerlab/transformerlab-app

Length of output: 20310


🏁 Script executed:

# Check if there are any consumers of provider_logs.txt to understand how critical real-time updates are
rg "provider_logs\.txt" -B 3 -A 3

Repository: transformerlab/transformerlab-app

Length of output: 3488


🏁 Script executed:

# Examine the actual memory impact concern - check typical subprocess patterns in the codebase
rg "subprocess\.run|subprocess\.Popen" -B 2 -A 5

Repository: transformerlab/transformerlab-app

Length of output: 36052


🏁 Script executed:

# Check if there are any long-running jobs or large output patterns that would motivate the streaming concern
rg "long.?run|stream|live.*log|real.*time" lab-sdk/ -i --type=py -B 1 -A 1 | head -50

Repository: transformerlab/transformerlab-app

Length of output: 1566


🏁 Script executed:

# Look at what "live" parameter actually means in the logs endpoint
cat -n api/transformerlab/routers/experiment/jobs.py | sed -n '1,100p'

Repository: transformerlab/transformerlab-app

Length of output: 4198


Consider the design intent before refactoring: the current approach intentionally captures logs for post-execution persistence.

The current flow (capture output → echo to stdout/stderr → write provider_logs.txt after completion) is by design. Real-time output already streams to the terminal via lines 136-147. For live log access during execution, the API endpoint has a live=true parameter to fetch logs directly from the compute provider instead of the cached file.

While streaming output line-by-line during execution and writing incrementally to provider_logs.txt could be useful for very long-running jobs with massive outputs, this would:

  • Change the error handling model (partial captures if job crashes mid-write)
  • Require merging stderr into stdout (the proposed Popen solution uses stderr=subprocess.STDOUT)
  • Still buffer all lines in memory (combined_logs_parts), so the memory spike concern isn't fully resolved

If incremental log persistence is desired, clarify whether it should be streamed only during execution, whether stderr should remain separate, and whether partial writes are acceptable on process failure.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lab-sdk/src/lab/remote_trap.py` around lines 124 - 159, The current
capture→echo→persist flow (subprocess.run called with command_str, producing
completed; echoing completed.stdout/completed.stderr; combining into
combined_logs and calling _write_provider_logs) is intentional for
post-execution persistence and live log access via the API’s live=true; do not
refactor to incremental streaming or Popen merging stdout/stderr now — instead
add a short clarifying comment near the subprocess.run/completed handling (and
reference completed, combined_logs, and _write_provider_logs) stating the design
rationale (capture for persistence, echo for real-time terminal output,
live=true for live access) so future contributors understand why incremental
writes/merging were avoided.

@deep1401 deep1401 merged commit 0b3d32b into main Feb 24, 2026
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants