Skip to content

Add python -m alpamayo_r1.healthcheck install smoke test#103

Open
lonexreb wants to merge 1 commit intoNVlabs:mainfrom
lonexreb:feat/add-healthcheck-module
Open

Add python -m alpamayo_r1.healthcheck install smoke test#103
lonexreb wants to merge 1 commit intoNVlabs:mainfrom
lonexreb:feat/add-healthcheck-module

Conversation

@lonexreb
Copy link
Copy Markdown
Contributor

@lonexreb lonexreb commented May 4, 2026

Why

Most of the support questions on the issue tracker boil down to "did I install this right?" — CUDA visible? flash-attn compiled? HuggingFace auth set up? PAI dataset reachable? transformers==4.57.1? — and today the answer only surfaces 30 seconds into a long inference run when something breaks.

This adds a fast smoke test that answers all of those without loading the model, so users can self-triage in seconds and so issue templates can ask for the output up front.

What

python -m alpamayo_r1.healthcheck

8 independent checks, ordered. Each returns a CheckResult(name, status, detail). Process exit code is the number of failures so CI can inspect it without parsing stdout. Skips (no GPU on a laptop, intentional flash-attn skip per README's SDPA fallback) do not count as failures.

# Check Why it matters
1 python Repo declares requires-python = "==3.12.*"
2 torch Imports + version (warn if not 2.8.x)
3 cuda Device count, name, total memory, CUDA version
4 transformers Pinned 4.57.1; mismatches break inference
5 flash_attn Importable; SKIP (not FAIL) when missing — SDPA is documented
6 physical_ai_av The PAI dataset loader dependency
7 hf_auth whoami() succeeds (model + dataset are gated)
8 alpamayo_r1 The package itself imports

Each check is wrapped in run() so a buggy or unexpected exception in one check is reported as a fail and never aborts the rest.

--quiet only prints failures (and the final summary line) so users can paste it into an issue.

Live output (this laptop, no GPU / partial install)

[FAIL] python           got 3.10.0, but pyproject.toml requires ==3.12.*
[ OK ] torch            torch==2.11.0 (note: pyproject.toml pins torch==2.8.0)
[SKIP] cuda             torch.cuda.is_available() is False -- no GPU visible
[FAIL] transformers     import failed: ModuleNotFoundError(...)
[SKIP] flash_attn       not importable. Inference still works via SDPA.
[FAIL] physical_ai_av   import failed: ... Run `uv sync --active` ...
[FAIL] hf_auth          ... Run `hf auth login` and accept the gated terms ...
[ OK ] alpamayo_r1      package importable

Summary: 2 passed, 4 failed, 2 skipped (8 total).

Every detail string is copy-pasteable into an issue and includes the recommended remediation (uv sync --active, hf auth login, or the SDPA fallback for flash_attn).

Tests

src/alpamayo_r1/test_healthcheck.py — 7 pytest cases. Verified locally:

PASS: CheckResult is frozen
PASS: _format_line shows correct glyphs
PASS: ALL_CHECKS has 8 callables
PASS: run() catches per-check exceptions
PASS: run() preserves order
PASS: main() returns 1 with one failure; summary correct
PASS: --quiet hides PASSes
PASS: exit 0 when only passes/skips
PASS: real run() returns 8 valid CheckResults

The runner deliberately reads ALL_CHECKS at call time (not as a default-arg snapshot) so tests can monkey-patch the module attribute.

Migration

None — new module, new file, new entry point. No existing code paths touched.

Suggested README hook (separate PR if welcome)

Add to Troubleshooting:

Before opening an install issue, please run:

python -m alpamayo_r1.healthcheck

and paste the output.

Happy to send that as a follow-up if this lands.

The most common support questions on the issue tracker boil down to
"did I install this right?" -- CUDA visible? flash-attn compiled?
HuggingFace auth set up? PAI dataset reachable? -- and today the
answer only surfaces 30 seconds into a long inference run when something
breaks. This adds a fast smoke test that answers all of those without
loading the model.

src/alpamayo_r1/healthcheck.py exposes 8 independent checks and a
`python -m alpamayo_r1.healthcheck` entry point. The exit code is the
number of failures (0 on green, non-zero count for CI). Skipped checks
(no GPU on a laptop, intentional flash-attn skip per README) do NOT
count as failures.

Checks (ordered):
1. python                -- 3.12.x as pyproject.toml requires
2. torch                 -- imports + version, warn if not 2.8.x
3. cuda                  -- device count, name, total memory, CUDA ver
4. transformers          -- 4.57.1 (pinned); mismatches break inference
5. flash_attn            -- importable; SKIP (not FAIL) when missing
                            since SDPA is a documented fallback
6. physical_ai_av        -- the PAI dataset loader dependency
7. hf_auth               -- whoami() succeeds (gated model + dataset)
8. alpamayo_r1           -- the package itself imports

Each check returns a CheckResult(name, status, detail) and is wrapped
in run() so a buggy/unexpected exception in one check is reported as a
fail and never aborts the rest.

Output (this laptop, no GPU/HF/torch fully installed):

    [FAIL] python        got 3.10.0, but pyproject.toml requires ==3.12.*
    [ OK ] torch         torch==2.11.0 (note: pyproject.toml pins torch==2.8.0)
    [SKIP] cuda          torch.cuda.is_available() is False -- no GPU
    [FAIL] transformers  import failed: ModuleNotFoundError(...)
    [SKIP] flash_attn    not importable. Inference still works via SDPA.
    [FAIL] physical_ai_av import failed: ... Run `uv sync --active`.
    [FAIL] hf_auth       Run `hf auth login` and accept the gated terms.
    [ OK ] alpamayo_r1   package importable

Summary: 2 passed, 4 failed, 2 skipped (8 total).

src/alpamayo_r1/test_healthcheck.py adds 7 pytest cases for the
runner: CheckResult immutability, _format_line glyphs/content, ALL_CHECKS
shape, run() exception isolation + ordering, main() exit code semantics
(failures count, skips do not, --quiet doesn't change exit code).

All 8 functional checks verified locally without GPU/HF dependencies --
output above is the actual run on this Mac.

Signed-off-by: lonexreb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant