Add `python -m alpamayo_r1.healthcheck` install smoke test by lonexreb · Pull Request #103 · NVlabs/alpamayo

lonexreb · 2026-05-04T14:47:01Z

Why

Most of the support questions on the issue tracker boil down to "did I install this right?" — CUDA visible? flash-attn compiled? HuggingFace auth set up? PAI dataset reachable? transformers==4.57.1? — and today the answer only surfaces 30 seconds into a long inference run when something breaks.

This adds a fast smoke test that answers all of those without loading the model, so users can self-triage in seconds and so issue templates can ask for the output up front.

What

python -m alpamayo_r1.healthcheck

8 independent checks, ordered. Each returns a CheckResult(name, status, detail). Process exit code is the number of failures so CI can inspect it without parsing stdout. Skips (no GPU on a laptop, intentional flash-attn skip per README's SDPA fallback) do not count as failures.

#	Check	Why it matters
1	`python`	Repo declares `requires-python = "==3.12.*"`
2	`torch`	Imports + version (warn if not `2.8.x`)
3	`cuda`	Device count, name, total memory, CUDA version
4	`transformers`	Pinned `4.57.1`; mismatches break inference
5	`flash_attn`	Importable; SKIP (not FAIL) when missing — SDPA is documented
6	`physical_ai_av`	The PAI dataset loader dependency
7	`hf_auth`	`whoami()` succeeds (model + dataset are gated)
8	`alpamayo_r1`	The package itself imports

Each check is wrapped in run() so a buggy or unexpected exception in one check is reported as a fail and never aborts the rest.

--quiet only prints failures (and the final summary line) so users can paste it into an issue.

Live output (this laptop, no GPU / partial install)

[FAIL] python           got 3.10.0, but pyproject.toml requires ==3.12.*
[ OK ] torch            torch==2.11.0 (note: pyproject.toml pins torch==2.8.0)
[SKIP] cuda             torch.cuda.is_available() is False -- no GPU visible
[FAIL] transformers     import failed: ModuleNotFoundError(...)
[SKIP] flash_attn       not importable. Inference still works via SDPA.
[FAIL] physical_ai_av   import failed: ... Run `uv sync --active` ...
[FAIL] hf_auth          ... Run `hf auth login` and accept the gated terms ...
[ OK ] alpamayo_r1      package importable

Summary: 2 passed, 4 failed, 2 skipped (8 total).

Every detail string is copy-pasteable into an issue and includes the recommended remediation (uv sync --active, hf auth login, or the SDPA fallback for flash_attn).

Tests

src/alpamayo_r1/test_healthcheck.py — 7 pytest cases. Verified locally:

PASS: CheckResult is frozen
PASS: _format_line shows correct glyphs
PASS: ALL_CHECKS has 8 callables
PASS: run() catches per-check exceptions
PASS: run() preserves order
PASS: main() returns 1 with one failure; summary correct
PASS: --quiet hides PASSes
PASS: exit 0 when only passes/skips
PASS: real run() returns 8 valid CheckResults

The runner deliberately reads ALL_CHECKS at call time (not as a default-arg snapshot) so tests can monkey-patch the module attribute.

Migration

None — new module, new file, new entry point. No existing code paths touched.

Suggested README hook (separate PR if welcome)

Add to Troubleshooting:

Before opening an install issue, please run:
python -m alpamayo_r1.healthcheck
and paste the output.

Happy to send that as a follow-up if this lands.

The most common support questions on the issue tracker boil down to "did I install this right?" -- CUDA visible? flash-attn compiled? HuggingFace auth set up? PAI dataset reachable? -- and today the answer only surfaces 30 seconds into a long inference run when something breaks. This adds a fast smoke test that answers all of those without loading the model. src/alpamayo_r1/healthcheck.py exposes 8 independent checks and a `python -m alpamayo_r1.healthcheck` entry point. The exit code is the number of failures (0 on green, non-zero count for CI). Skipped checks (no GPU on a laptop, intentional flash-attn skip per README) do NOT count as failures. Checks (ordered): 1. python -- 3.12.x as pyproject.toml requires 2. torch -- imports + version, warn if not 2.8.x 3. cuda -- device count, name, total memory, CUDA ver 4. transformers -- 4.57.1 (pinned); mismatches break inference 5. flash_attn -- importable; SKIP (not FAIL) when missing since SDPA is a documented fallback 6. physical_ai_av -- the PAI dataset loader dependency 7. hf_auth -- whoami() succeeds (gated model + dataset) 8. alpamayo_r1 -- the package itself imports Each check returns a CheckResult(name, status, detail) and is wrapped in run() so a buggy/unexpected exception in one check is reported as a fail and never aborts the rest. Output (this laptop, no GPU/HF/torch fully installed): [FAIL] python got 3.10.0, but pyproject.toml requires ==3.12.* [ OK ] torch torch==2.11.0 (note: pyproject.toml pins torch==2.8.0) [SKIP] cuda torch.cuda.is_available() is False -- no GPU [FAIL] transformers import failed: ModuleNotFoundError(...) [SKIP] flash_attn not importable. Inference still works via SDPA. [FAIL] physical_ai_av import failed: ... Run `uv sync --active`. [FAIL] hf_auth Run `hf auth login` and accept the gated terms. [ OK ] alpamayo_r1 package importable Summary: 2 passed, 4 failed, 2 skipped (8 total). src/alpamayo_r1/test_healthcheck.py adds 7 pytest cases for the runner: CheckResult immutability, _format_line glyphs/content, ALL_CHECKS shape, run() exception isolation + ordering, main() exit code semantics (failures count, skips do not, --quiet doesn't change exit code). All 8 functional checks verified locally without GPU/HF dependencies -- output above is the actual run on this Mac. Signed-off-by: lonexreb <[email protected]>

lonexreb mentioned this pull request May 5, 2026

README: point install issues to alpamayo_r1.healthcheck first #104

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `python -m alpamayo_r1.healthcheck` install smoke test#103

Add `python -m alpamayo_r1.healthcheck` install smoke test#103
lonexreb wants to merge 1 commit intoNVlabs:mainfrom
lonexreb:feat/add-healthcheck-module

lonexreb commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lonexreb commented May 4, 2026

Why

What

Live output (this laptop, no GPU / partial install)

Tests

Migration

Suggested README hook (separate PR if welcome)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant