Skip to content

Simplify installer#1628

Merged
deep1401 merged 27 commits intomainfrom
fix/simplify-installer
Mar 26, 2026
Merged

Simplify installer#1628
deep1401 merged 27 commits intomainfrom
fix/simplify-installer

Conversation

@deep1401
Copy link
Copy Markdown
Member

@deep1401 deep1401 commented Mar 24, 2026

To try:

  • Just run ./install.sh multiuser_setup
  • Ensure ~/.transformerlab also has a lab-sdk dir now
  • Everything will now run through ~/.transformerlab/envs/general-uv
  • Delete your existing local provider and try installing a new one

@sentry
Copy link
Copy Markdown

sentry bot commented Mar 24, 2026

Codecov Report

❌ Patch coverage is 37.79070% with 107 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
api/transformerlab/compute_providers/local.py 37.58% 79 Missing and 9 partials ⚠️
api/transformerlab/routers/compute_provider.py 19.04% 17 Missing ⚠️
api/transformerlab/services/provider_service.py 80.00% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@deep1401 deep1401 marked this pull request as ready for review March 24, 2026 17:30
@aliasaria
Copy link
Copy Markdown
Member

Review Summary

This branch simplifies the installer by splitting the API's pyproject.toml into a lean base (for CI/multiuser) and a heavy localprovider_pyproject.toml (for local GPU provider), removes the transformers and fastchat dependencies from the core API, adds a dedicated conda install script for local providers, and wires up provider setup refresh in the backend and UI. The architecture is sound — separating the heavyweight ML deps from the API server is a good direction. The main concerns are a ripgrep (rg) dependency in a shell script that may not be installed on target systems, a silently swallowed exception in background setup, and file handle leaks.


Must Fix (blocks merge)

[Correctness] rg (ripgrep) used in local_provider_conda_install.sh without fallback — The script uses rg -Fq on line 38 and rg -iq on line 97, but ripgrep is not a standard utility and won't be present on many Linux servers. This will cause the conda install to fail on those machines.

api/local_provider_conda_install.sh:38 and :97:

# Current (broken on systems without rg):
if ! conda env list | awk '{print $1}' | rg -Fq "${ENV_DIR}"; then
# ...
elif [ -r /etc/dgx-release ] && rg -iq 'DGX Spark' /etc/dgx-release; then

Fix — use grep which is universally available:

if ! conda env list | awk '{print $1}' | grep -Fq "${ENV_DIR}"; then
# ...
elif [ -r /etc/dgx-release ] && grep -iq 'DGX Spark' /etc/dgx-release; then

[Correctness] File handle leak in launch_clusterstdout_log and stderr_log are opened but never closed on the happy path (api/transformerlab/compute_providers/local.py:474-475). After Popen is called, these file handles are leaked in the parent process. Since the subprocess has inherited them, the logs will still work, but the parent accumulates leaked FDs over time.

Fix:

# After proc = subprocess.Popen(...):
stdout_log.close()
stderr_log.close()

Or even better, close them in a finally block to handle setup failures too. (Note: the setup failure path at line 493 closes stderr_log but not stdout_log.)


Should Fix (important but not blocking)

[Correctness] Silently swallowed exception in background setupapi/transformerlab/services/provider_service.py:423-425: the background _run_local_setup_background catches Exception with a bare pass. If setup fails for a newly auto-created local provider, there's zero signal to anyone — no log, no notification. At minimum, log the error:

except Exception:
    import logging
    logging.getLogger(__name__).warning("Background local provider setup failed", exc_info=True)

[Correctness] CI workflow no longer installs torch — In .github/workflows/pytest.yml, the change goes from uv pip install --system .[cpu] --index=... to just uv pip install --system .. The base pyproject.toml no longer includes torch in its dependencies, so CI tests that import torch (or any code path that reaches it) will fail. Verify that all CI tests pass without torch, or add torch back as a test dependency.

[Security] trust_remote_code=True removed routes — The removed /model/chat_template and /data/preview_with_chat_template endpoints used AutoTokenizer.from_pretrained(..., trust_remote_code=True) with user-supplied model names. Good that these are gone — they were an RCE vector. Just flagging this as a positive security improvement worth noting in the PR description.


Consider Improving

  1. Hooks section in ProviderDetailsModal is missing from the rendered JSX — The component has state for hooksExpanded, preTaskHook, postTaskHook, etc., but I don't see the hooks UI section rendered in the JSX. If hooks were intentionally removed from the UI, clean up the dead state variables.

  2. local_provider_conda_install.sh doesn't clean up TMP_PROJECT_DIR on failure — If the uv pip install commands fail partway, the temp directory at line 69 is never cleaned up. Consider adding a trap for cleanup.

  3. Version drift between pyproject.toml and localprovider_pyproject.toml — Several packages appear in both files with different pinned versions (e.g., fastapi 0.115.12 vs 0.115.7, psutil 7.0.0 vs 6.1.1, sqlalchemy 2.0.40 vs 2.0.38). This could cause confusing behavior when the API server imports a different version than what the local provider installs. Consider documenting why or aligning them.


What's Working Well

  1. Clean separation of concerns — Splitting the pyproject into a lightweight API manifest and a heavyweight local-provider manifest is a solid architectural decision that dramatically speeds up CI and multiuser installs.

  2. Provider setup UX — The polling-based setup progress with log tailing in the ProviderDetailsModal gives users real visibility into what's happening during the slow conda install. Well-implemented with proper cleanup.

  3. Removal of transformers/fastchat from the API core — Eliminating these heavy transitive dependencies from the server startup path is a meaningful improvement. The endpoints that depended on them were thin wrappers.

  4. CI simplification — The workflows are much cleaner — no more conda activation dance in CI, just uv pip install --system .. The explicit webapp static files step is clearer than the implicit install.sh approach.

@deep1401
Copy link
Copy Markdown
Member Author

Review Summary

This branch simplifies the installer by splitting the API's pyproject.toml into a lean base (for CI/multiuser) and a heavy localprovider_pyproject.toml (for local GPU provider), removes the transformers and fastchat dependencies from the core API, adds a dedicated conda install script for local providers, and wires up provider setup refresh in the backend and UI. The architecture is sound — separating the heavyweight ML deps from the API server is a good direction. The main concerns are a ripgrep (rg) dependency in a shell script that may not be installed on target systems, a silently swallowed exception in background setup, and file handle leaks.

Must Fix (blocks merge)

[Correctness] rg (ripgrep) used in local_provider_conda_install.sh without fallback — The script uses rg -Fq on line 38 and rg -iq on line 97, but ripgrep is not a standard utility and won't be present on many Linux servers. This will cause the conda install to fail on those machines.

api/local_provider_conda_install.sh:38 and :97:

# Current (broken on systems without rg):
if ! conda env list | awk '{print $1}' | rg -Fq "${ENV_DIR}"; then
# ...
elif [ -r /etc/dgx-release ] && rg -iq 'DGX Spark' /etc/dgx-release; then

Fix — use grep which is universally available:

if ! conda env list | awk '{print $1}' | grep -Fq "${ENV_DIR}"; then
# ...
elif [ -r /etc/dgx-release ] && grep -iq 'DGX Spark' /etc/dgx-release; then

[Correctness] File handle leak in launch_clusterstdout_log and stderr_log are opened but never closed on the happy path (api/transformerlab/compute_providers/local.py:474-475). After Popen is called, these file handles are leaked in the parent process. Since the subprocess has inherited them, the logs will still work, but the parent accumulates leaked FDs over time.

Fix:

# After proc = subprocess.Popen(...):
stdout_log.close()
stderr_log.close()

Or even better, close them in a finally block to handle setup failures too. (Note: the setup failure path at line 493 closes stderr_log but not stdout_log.)

Should Fix (important but not blocking)

[Correctness] Silently swallowed exception in background setupapi/transformerlab/services/provider_service.py:423-425: the background _run_local_setup_background catches Exception with a bare pass. If setup fails for a newly auto-created local provider, there's zero signal to anyone — no log, no notification. At minimum, log the error:

except Exception:
    import logging
    logging.getLogger(__name__).warning("Background local provider setup failed", exc_info=True)

[Correctness] CI workflow no longer installs torch — In .github/workflows/pytest.yml, the change goes from uv pip install --system .[cpu] --index=... to just uv pip install --system .. The base pyproject.toml no longer includes torch in its dependencies, so CI tests that import torch (or any code path that reaches it) will fail. Verify that all CI tests pass without torch, or add torch back as a test dependency.

[Security] trust_remote_code=True removed routes — The removed /model/chat_template and /data/preview_with_chat_template endpoints used AutoTokenizer.from_pretrained(..., trust_remote_code=True) with user-supplied model names. Good that these are gone — they were an RCE vector. Just flagging this as a positive security improvement worth noting in the PR description.

Consider Improving

  1. Hooks section in ProviderDetailsModal is missing from the rendered JSX — The component has state for hooksExpanded, preTaskHook, postTaskHook, etc., but I don't see the hooks UI section rendered in the JSX. If hooks were intentionally removed from the UI, clean up the dead state variables.
  2. local_provider_conda_install.sh doesn't clean up TMP_PROJECT_DIR on failure — If the uv pip install commands fail partway, the temp directory at line 69 is never cleaned up. Consider adding a trap for cleanup.
  3. Version drift between pyproject.toml and localprovider_pyproject.toml — Several packages appear in both files with different pinned versions (e.g., fastapi 0.115.12 vs 0.115.7, psutil 7.0.0 vs 6.1.1, sqlalchemy 2.0.40 vs 2.0.38). This could cause confusing behavior when the API server imports a different version than what the local provider installs. Consider documenting why or aligning them.

What's Working Well

  1. Clean separation of concerns — Splitting the pyproject into a lightweight API manifest and a heavyweight local-provider manifest is a solid architectural decision that dramatically speeds up CI and multiuser installs.
  2. Provider setup UX — The polling-based setup progress with log tailing in the ProviderDetailsModal gives users real visibility into what's happening during the slow conda install. Well-implemented with proper cleanup.
  3. Removal of transformers/fastchat from the API core — Eliminating these heavy transitive dependencies from the server startup path is a meaningful improvement. The endpoints that depended on them were thin wrappers.
  4. CI simplification — The workflows are much cleaner — no more conda activation dance in CI, just uv pip install --system .. The explicit webapp static files step is clearer than the implicit install.sh approach.

fixed all of these

@deep1401 deep1401 merged commit 30076cd into main Mar 26, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants