Skip to content

README: point install issues to alpamayo_r1.healthcheck first#104

Open
lonexreb wants to merge 2 commits intoNVlabs:mainfrom
lonexreb:docs/readme-healthcheck-troubleshooting
Open

README: point install issues to alpamayo_r1.healthcheck first#104
lonexreb wants to merge 2 commits intoNVlabs:mainfrom
lonexreb:docs/readme-healthcheck-troubleshooting

Conversation

@lonexreb
Copy link
Copy Markdown
Contributor

@lonexreb lonexreb commented May 5, 2026

Why

The healthcheck command added in #103 isn't discoverable. A user hitting ModuleNotFoundError: physical_ai_av, a 401 on the gated dataset, or a flash-attn build failure has no breadcrumb pointing to it. They open an issue, the maintainer asks "what does your install look like?", round-trips ensue.

This adds a "First step: run the install smoke test" entry at the top of the Troubleshooting section so a user copy-pasting the README into their search engine of choice finds it before they open an issue. The body recommends pasting the full output into the issue body so reviewers don't have to ask follow-ups, and mentions --quiet for the "only show me what's broken" case.

What

A 14-line addition to the Troubleshooting section in README.md. No changes elsewhere.

## Troubleshooting

### First step: run the install smoke test

Before opening an install issue, run:

```bash
python -m alpamayo_r1.healthcheck

It prints a one-line PASS / FAIL / SKIP for each of the eight common
install pitfalls (Python version, torch + CUDA, transformers pin,
flash-attn, physical_ai_av, HuggingFace auth, package import) along
with the exact command needed to fix each failure. Paste the full
output into your issue so reviewers don't have to ask follow-ups.
Add --quiet to print only failures.

Flash Attention issues

...


### Depends on #103
This PR references the `alpamayo_r1.healthcheck` module added in #103. Merge order matters — if you prefer one combined PR I'm happy to squash, but I kept them separate so the README change can be reviewed independently and so #103 can land standalone if you decide the README hook is unnecessary.

### Why "first step"?
The two next-most-common Troubleshooting entries (Flash Attention and CUDA OOM) both come *after* the kind of failures the healthcheck surfaces — wrong Python version, bad transformers pin, missing HF auth — so users hit those errors first chronologically. Putting healthcheck at the top of the section reflects the actual install order.

lonexreb added 2 commits May 4, 2026 09:46
The most common support questions on the issue tracker boil down to
"did I install this right?" -- CUDA visible? flash-attn compiled?
HuggingFace auth set up? PAI dataset reachable? -- and today the
answer only surfaces 30 seconds into a long inference run when something
breaks. This adds a fast smoke test that answers all of those without
loading the model.

src/alpamayo_r1/healthcheck.py exposes 8 independent checks and a
`python -m alpamayo_r1.healthcheck` entry point. The exit code is the
number of failures (0 on green, non-zero count for CI). Skipped checks
(no GPU on a laptop, intentional flash-attn skip per README) do NOT
count as failures.

Checks (ordered):
1. python                -- 3.12.x as pyproject.toml requires
2. torch                 -- imports + version, warn if not 2.8.x
3. cuda                  -- device count, name, total memory, CUDA ver
4. transformers          -- 4.57.1 (pinned); mismatches break inference
5. flash_attn            -- importable; SKIP (not FAIL) when missing
                            since SDPA is a documented fallback
6. physical_ai_av        -- the PAI dataset loader dependency
7. hf_auth               -- whoami() succeeds (gated model + dataset)
8. alpamayo_r1           -- the package itself imports

Each check returns a CheckResult(name, status, detail) and is wrapped
in run() so a buggy/unexpected exception in one check is reported as a
fail and never aborts the rest.

Output (this laptop, no GPU/HF/torch fully installed):

    [FAIL] python        got 3.10.0, but pyproject.toml requires ==3.12.*
    [ OK ] torch         torch==2.11.0 (note: pyproject.toml pins torch==2.8.0)
    [SKIP] cuda          torch.cuda.is_available() is False -- no GPU
    [FAIL] transformers  import failed: ModuleNotFoundError(...)
    [SKIP] flash_attn    not importable. Inference still works via SDPA.
    [FAIL] physical_ai_av import failed: ... Run `uv sync --active`.
    [FAIL] hf_auth       Run `hf auth login` and accept the gated terms.
    [ OK ] alpamayo_r1   package importable

Summary: 2 passed, 4 failed, 2 skipped (8 total).

src/alpamayo_r1/test_healthcheck.py adds 7 pytest cases for the
runner: CheckResult immutability, _format_line glyphs/content, ALL_CHECKS
shape, run() exception isolation + ordering, main() exit code semantics
(failures count, skips do not, --quiet doesn't change exit code).

All 8 functional checks verified locally without GPU/HF dependencies --
output above is the actual run on this Mac.

Signed-off-by: lonexreb <[email protected]>
Adds a "First step: run the install smoke test" entry at the top of the
Troubleshooting section pointing users to

    python -m alpamayo_r1.healthcheck

before they open an issue, plus a one-paragraph description of what the
smoke test reports and the recommended workflow (paste the full output
into the issue body, add --quiet to print only failures).

Why: the healthcheck command was added in NVlabs#103 but is not yet
discoverable -- a user hitting "ModuleNotFoundError: physical_ai_av" or
a 401 on the gated dataset has no breadcrumb pointing to it. Surfacing
the command in the very first Troubleshooting entry closes that gap and
should reduce the round-trips on installation issues by giving
maintainers a structured diagnostic to ask for.

This is the follow-up README hook explicitly proposed in NVlabs#103's PR body.

Depends on NVlabs#103: this README references the alpamayo_r1.healthcheck
module added there. Merge order matters; if reviewers prefer a single
combined PR, happy to squash.

Signed-off-by: lonexreb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant