Skip to content

Expand Troubleshooting with HF auth, FA2 cu_seqlens, and smoke tests#79

Open
lonexreb wants to merge 1 commit intoNVlabs:mainfrom
lonexreb:docs/expand-troubleshooting-section
Open

Expand Troubleshooting with HF auth, FA2 cu_seqlens, and smoke tests#79
lonexreb wants to merge 1 commit intoNVlabs:mainfrom
lonexreb:docs/expand-troubleshooting-section

Conversation

@lonexreb
Copy link
Copy Markdown
Contributor

@lonexreb lonexreb commented May 3, 2026

Summary

Doc-only expansion of the existing Troubleshooting section in README.md. No code changes. Addresses several recurring issues by giving users a fast path to a fix without filing a new ticket.

What's added

1. IndexError: list index out of range when loading the dataset

Reproduces the exact traceback fragment users see when HF auth or gated-dataset access is missing, and points to the existing §3 Authenticate with HuggingFace step. Pairs with PR #74 which surfaces the same diagnostic at runtime.

→ Addresses #59, #61.

2. RuntimeError: cu_seqlens_q must have shape (batch_size + 1) with FlashAttention-2

One-paragraph explanation: the diffusion expert uses a 4D additive attention mask to handle variable-length VLM rollouts, which FA2 cannot consume. The shipped configuration keeps the VLM on FA2 and the expert on SDPA — so the error only appears when users force FA2 globally. Mentions the expert_cfg.attn_implementation = "sdpa" escape hatch.

→ Addresses #52. Pairs with PR #75 which makes this default explicit in code.

3. Verifying your local installation

Three quick import-level smoke tests (torch.cuda.is_available(), from alpamayo_r1.models.alpamayo_r1 import AlpamayoR1, huggingface-cli whoami) plus the end-to-end test_inference.py command, so users can bisect a broken environment in a few seconds.

→ Addresses #33.

Why this is worth doing as docs

Looking at the open-issue distribution: the same handful of root causes (HF auth, FA2 incompatibility with the expert, missing GPU/imports) explain a large fraction of bug reports. A focused Troubleshooting section is the smallest-blast-radius fix that also reduces maintainer triage load.

Test plan

  • Markdown renders cleanly.
  • Anchors #3-authenticate-with-huggingface resolve to the existing heading.
  • No links to non-existent files or future PR numbers.

Address recurring questions in the issue tracker by extending the
existing Troubleshooting section:

1. New entry for the IndexError that physical_ai_av raises when the
   gated PhysicalAI-Autonomous-Vehicles dataset is inaccessible —
   reproduces the exact traceback fragment users will see and links to
   the auth section. Addresses NVlabs#59, NVlabs#61.

2. New entry for the FA2 cu_seqlens_q crash in the diffusion expert
   path when FA2 is forced globally — explains the 4D-mask root cause
   in one paragraph and the shipped VLM-FA2 / expert-SDPA split.
   Addresses NVlabs#52.

3. New entry with three import-level smoke tests (torch.cuda,
   alpamayo_r1 import, huggingface-cli whoami) plus the end-to-end
   test_inference.py command, so users can rapidly bisect a broken
   environment. Addresses NVlabs#33.

No code changes.

Signed-off-by: lonexreb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant