Add python -m alpamayo_r1.healthcheck install smoke test#103
Open
lonexreb wants to merge 1 commit intoNVlabs:mainfrom
Open
Add python -m alpamayo_r1.healthcheck install smoke test#103lonexreb wants to merge 1 commit intoNVlabs:mainfrom
python -m alpamayo_r1.healthcheck install smoke test#103lonexreb wants to merge 1 commit intoNVlabs:mainfrom
Conversation
The most common support questions on the issue tracker boil down to
"did I install this right?" -- CUDA visible? flash-attn compiled?
HuggingFace auth set up? PAI dataset reachable? -- and today the
answer only surfaces 30 seconds into a long inference run when something
breaks. This adds a fast smoke test that answers all of those without
loading the model.
src/alpamayo_r1/healthcheck.py exposes 8 independent checks and a
`python -m alpamayo_r1.healthcheck` entry point. The exit code is the
number of failures (0 on green, non-zero count for CI). Skipped checks
(no GPU on a laptop, intentional flash-attn skip per README) do NOT
count as failures.
Checks (ordered):
1. python -- 3.12.x as pyproject.toml requires
2. torch -- imports + version, warn if not 2.8.x
3. cuda -- device count, name, total memory, CUDA ver
4. transformers -- 4.57.1 (pinned); mismatches break inference
5. flash_attn -- importable; SKIP (not FAIL) when missing
since SDPA is a documented fallback
6. physical_ai_av -- the PAI dataset loader dependency
7. hf_auth -- whoami() succeeds (gated model + dataset)
8. alpamayo_r1 -- the package itself imports
Each check returns a CheckResult(name, status, detail) and is wrapped
in run() so a buggy/unexpected exception in one check is reported as a
fail and never aborts the rest.
Output (this laptop, no GPU/HF/torch fully installed):
[FAIL] python got 3.10.0, but pyproject.toml requires ==3.12.*
[ OK ] torch torch==2.11.0 (note: pyproject.toml pins torch==2.8.0)
[SKIP] cuda torch.cuda.is_available() is False -- no GPU
[FAIL] transformers import failed: ModuleNotFoundError(...)
[SKIP] flash_attn not importable. Inference still works via SDPA.
[FAIL] physical_ai_av import failed: ... Run `uv sync --active`.
[FAIL] hf_auth Run `hf auth login` and accept the gated terms.
[ OK ] alpamayo_r1 package importable
Summary: 2 passed, 4 failed, 2 skipped (8 total).
src/alpamayo_r1/test_healthcheck.py adds 7 pytest cases for the
runner: CheckResult immutability, _format_line glyphs/content, ALL_CHECKS
shape, run() exception isolation + ordering, main() exit code semantics
(failures count, skips do not, --quiet doesn't change exit code).
All 8 functional checks verified locally without GPU/HF dependencies --
output above is the actual run on this Mac.
Signed-off-by: lonexreb <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Most of the support questions on the issue tracker boil down to "did I install this right?" — CUDA visible?
flash-attncompiled? HuggingFace auth set up? PAI dataset reachable?transformers==4.57.1? — and today the answer only surfaces 30 seconds into a long inference run when something breaks.This adds a fast smoke test that answers all of those without loading the model, so users can self-triage in seconds and so issue templates can ask for the output up front.
What
python -m alpamayo_r1.healthcheck8 independent checks, ordered. Each returns a
CheckResult(name, status, detail). Process exit code is the number of failures so CI can inspect it without parsing stdout. Skips (no GPU on a laptop, intentionalflash-attnskip per README's SDPA fallback) do not count as failures.pythonrequires-python = "==3.12.*"torch2.8.x)cudatransformers4.57.1; mismatches break inferenceflash_attnphysical_ai_avhf_authwhoami()succeeds (model + dataset are gated)alpamayo_r1Each check is wrapped in
run()so a buggy or unexpected exception in one check is reported as a fail and never aborts the rest.--quietonly prints failures (and the final summary line) so users can paste it into an issue.Live output (this laptop, no GPU / partial install)
Every detail string is copy-pasteable into an issue and includes the recommended remediation (
uv sync --active,hf auth login, or the SDPA fallback forflash_attn).Tests
src/alpamayo_r1/test_healthcheck.py— 7 pytest cases. Verified locally:The runner deliberately reads
ALL_CHECKSat call time (not as a default-arg snapshot) so tests can monkey-patch the module attribute.Migration
None — new module, new file, new entry point. No existing code paths touched.
Suggested README hook (separate PR if welcome)
Add to Troubleshooting:
Happy to send that as a follow-up if this lands.