README: point install issues to alpamayo_r1.healthcheck first by lonexreb · Pull Request #104 · NVlabs/alpamayo

lonexreb · 2026-05-05T18:43:28Z

Why

The healthcheck command added in #103 isn't discoverable. A user hitting ModuleNotFoundError: physical_ai_av, a 401 on the gated dataset, or a flash-attn build failure has no breadcrumb pointing to it. They open an issue, the maintainer asks "what does your install look like?", round-trips ensue.

This adds a "First step: run the install smoke test" entry at the top of the Troubleshooting section so a user copy-pasting the README into their search engine of choice finds it before they open an issue. The body recommends pasting the full output into the issue body so reviewers don't have to ask follow-ups, and mentions --quiet for the "only show me what's broken" case.

What

A 14-line addition to the Troubleshooting section in README.md. No changes elsewhere.

## Troubleshooting

### First step: run the install smoke test

Before opening an install issue, run:

```bash
python -m alpamayo_r1.healthcheck

It prints a one-line PASS / FAIL / SKIP for each of the eight common
install pitfalls (Python version, torch + CUDA, transformers pin,
flash-attn, physical_ai_av, HuggingFace auth, package import) along
with the exact command needed to fix each failure. Paste the full
output into your issue so reviewers don't have to ask follow-ups.
Add --quiet to print only failures.

Flash Attention issues

...


### Depends on #103
This PR references the `alpamayo_r1.healthcheck` module added in #103. Merge order matters — if you prefer one combined PR I'm happy to squash, but I kept them separate so the README change can be reviewed independently and so #103 can land standalone if you decide the README hook is unnecessary.

### Why "first step"?
The two next-most-common Troubleshooting entries (Flash Attention and CUDA OOM) both come *after* the kind of failures the healthcheck surfaces — wrong Python version, bad transformers pin, missing HF auth — so users hit those errors first chronologically. Putting healthcheck at the top of the section reflects the actual install order.

The most common support questions on the issue tracker boil down to "did I install this right?" -- CUDA visible? flash-attn compiled? HuggingFace auth set up? PAI dataset reachable? -- and today the answer only surfaces 30 seconds into a long inference run when something breaks. This adds a fast smoke test that answers all of those without loading the model. src/alpamayo_r1/healthcheck.py exposes 8 independent checks and a `python -m alpamayo_r1.healthcheck` entry point. The exit code is the number of failures (0 on green, non-zero count for CI). Skipped checks (no GPU on a laptop, intentional flash-attn skip per README) do NOT count as failures. Checks (ordered): 1. python -- 3.12.x as pyproject.toml requires 2. torch -- imports + version, warn if not 2.8.x 3. cuda -- device count, name, total memory, CUDA ver 4. transformers -- 4.57.1 (pinned); mismatches break inference 5. flash_attn -- importable; SKIP (not FAIL) when missing since SDPA is a documented fallback 6. physical_ai_av -- the PAI dataset loader dependency 7. hf_auth -- whoami() succeeds (gated model + dataset) 8. alpamayo_r1 -- the package itself imports Each check returns a CheckResult(name, status, detail) and is wrapped in run() so a buggy/unexpected exception in one check is reported as a fail and never aborts the rest. Output (this laptop, no GPU/HF/torch fully installed): [FAIL] python got 3.10.0, but pyproject.toml requires ==3.12.* [ OK ] torch torch==2.11.0 (note: pyproject.toml pins torch==2.8.0) [SKIP] cuda torch.cuda.is_available() is False -- no GPU [FAIL] transformers import failed: ModuleNotFoundError(...) [SKIP] flash_attn not importable. Inference still works via SDPA. [FAIL] physical_ai_av import failed: ... Run `uv sync --active`. [FAIL] hf_auth Run `hf auth login` and accept the gated terms. [ OK ] alpamayo_r1 package importable Summary: 2 passed, 4 failed, 2 skipped (8 total). src/alpamayo_r1/test_healthcheck.py adds 7 pytest cases for the runner: CheckResult immutability, _format_line glyphs/content, ALL_CHECKS shape, run() exception isolation + ordering, main() exit code semantics (failures count, skips do not, --quiet doesn't change exit code). All 8 functional checks verified locally without GPU/HF dependencies -- output above is the actual run on this Mac. Signed-off-by: lonexreb <[email protected]>

Adds a "First step: run the install smoke test" entry at the top of the Troubleshooting section pointing users to python -m alpamayo_r1.healthcheck before they open an issue, plus a one-paragraph description of what the smoke test reports and the recommended workflow (paste the full output into the issue body, add --quiet to print only failures). Why: the healthcheck command was added in NVlabs#103 but is not yet discoverable -- a user hitting "ModuleNotFoundError: physical_ai_av" or a 401 on the gated dataset has no breadcrumb pointing to it. Surfacing the command in the very first Troubleshooting entry closes that gap and should reduce the round-trips on installation issues by giving maintainers a structured diagnostic to ask for. This is the follow-up README hook explicitly proposed in NVlabs#103's PR body. Depends on NVlabs#103: this README references the alpamayo_r1.healthcheck module added there. Merge order matters; if reviewers prefer a single combined PR, happy to squash. Signed-off-by: lonexreb <[email protected]>

lonexreb added 2 commits May 4, 2026 09:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README: point install issues to alpamayo_r1.healthcheck first#104

README: point install issues to alpamayo_r1.healthcheck first#104
lonexreb wants to merge 2 commits intoNVlabs:mainfrom
lonexreb:docs/readme-healthcheck-troubleshooting

lonexreb commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lonexreb commented May 5, 2026

Why

What

Flash Attention issues

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant