README: point install issues to alpamayo_r1.healthcheck first#104
Open
lonexreb wants to merge 2 commits intoNVlabs:mainfrom
Open
README: point install issues to alpamayo_r1.healthcheck first#104lonexreb wants to merge 2 commits intoNVlabs:mainfrom
lonexreb wants to merge 2 commits intoNVlabs:mainfrom
Conversation
The most common support questions on the issue tracker boil down to
"did I install this right?" -- CUDA visible? flash-attn compiled?
HuggingFace auth set up? PAI dataset reachable? -- and today the
answer only surfaces 30 seconds into a long inference run when something
breaks. This adds a fast smoke test that answers all of those without
loading the model.
src/alpamayo_r1/healthcheck.py exposes 8 independent checks and a
`python -m alpamayo_r1.healthcheck` entry point. The exit code is the
number of failures (0 on green, non-zero count for CI). Skipped checks
(no GPU on a laptop, intentional flash-attn skip per README) do NOT
count as failures.
Checks (ordered):
1. python -- 3.12.x as pyproject.toml requires
2. torch -- imports + version, warn if not 2.8.x
3. cuda -- device count, name, total memory, CUDA ver
4. transformers -- 4.57.1 (pinned); mismatches break inference
5. flash_attn -- importable; SKIP (not FAIL) when missing
since SDPA is a documented fallback
6. physical_ai_av -- the PAI dataset loader dependency
7. hf_auth -- whoami() succeeds (gated model + dataset)
8. alpamayo_r1 -- the package itself imports
Each check returns a CheckResult(name, status, detail) and is wrapped
in run() so a buggy/unexpected exception in one check is reported as a
fail and never aborts the rest.
Output (this laptop, no GPU/HF/torch fully installed):
[FAIL] python got 3.10.0, but pyproject.toml requires ==3.12.*
[ OK ] torch torch==2.11.0 (note: pyproject.toml pins torch==2.8.0)
[SKIP] cuda torch.cuda.is_available() is False -- no GPU
[FAIL] transformers import failed: ModuleNotFoundError(...)
[SKIP] flash_attn not importable. Inference still works via SDPA.
[FAIL] physical_ai_av import failed: ... Run `uv sync --active`.
[FAIL] hf_auth Run `hf auth login` and accept the gated terms.
[ OK ] alpamayo_r1 package importable
Summary: 2 passed, 4 failed, 2 skipped (8 total).
src/alpamayo_r1/test_healthcheck.py adds 7 pytest cases for the
runner: CheckResult immutability, _format_line glyphs/content, ALL_CHECKS
shape, run() exception isolation + ordering, main() exit code semantics
(failures count, skips do not, --quiet doesn't change exit code).
All 8 functional checks verified locally without GPU/HF dependencies --
output above is the actual run on this Mac.
Signed-off-by: lonexreb <[email protected]>
Adds a "First step: run the install smoke test" entry at the top of the
Troubleshooting section pointing users to
python -m alpamayo_r1.healthcheck
before they open an issue, plus a one-paragraph description of what the
smoke test reports and the recommended workflow (paste the full output
into the issue body, add --quiet to print only failures).
Why: the healthcheck command was added in NVlabs#103 but is not yet
discoverable -- a user hitting "ModuleNotFoundError: physical_ai_av" or
a 401 on the gated dataset has no breadcrumb pointing to it. Surfacing
the command in the very first Troubleshooting entry closes that gap and
should reduce the round-trips on installation issues by giving
maintainers a structured diagnostic to ask for.
This is the follow-up README hook explicitly proposed in NVlabs#103's PR body.
Depends on NVlabs#103: this README references the alpamayo_r1.healthcheck
module added there. Merge order matters; if reviewers prefer a single
combined PR, happy to squash.
Signed-off-by: lonexreb <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The healthcheck command added in #103 isn't discoverable. A user hitting
ModuleNotFoundError: physical_ai_av, a 401 on the gated dataset, or aflash-attnbuild failure has no breadcrumb pointing to it. They open an issue, the maintainer asks "what does your install look like?", round-trips ensue.This adds a "First step: run the install smoke test" entry at the top of the Troubleshooting section so a user copy-pasting the README into their search engine of choice finds it before they open an issue. The body recommends pasting the full output into the issue body so reviewers don't have to ask follow-ups, and mentions
--quietfor the "only show me what's broken" case.What
A 14-line addition to the Troubleshooting section in
README.md. No changes elsewhere.It prints a one-line PASS / FAIL / SKIP for each of the eight common
install pitfalls (Python version, torch + CUDA,
transformerspin,flash-attn,physical_ai_av, HuggingFace auth, package import) alongwith the exact command needed to fix each failure. Paste the full
output into your issue so reviewers don't have to ask follow-ups.
Add
--quietto print only failures.Flash Attention issues
...