Expand Troubleshooting with HF auth, FA2 cu_seqlens, and smoke tests by lonexreb · Pull Request #79 · NVlabs/alpamayo

lonexreb · 2026-05-03T06:20:49Z

Summary

Doc-only expansion of the existing Troubleshooting section in README.md. No code changes. Addresses several recurring issues by giving users a fast path to a fix without filing a new ticket.

What's added

1. `IndexError: list index out of range` when loading the dataset

Reproduces the exact traceback fragment users see when HF auth or gated-dataset access is missing, and points to the existing §3 Authenticate with HuggingFace step. Pairs with PR #74 which surfaces the same diagnostic at runtime.

→ Addresses #59, #61.

2. `RuntimeError: cu_seqlens_q must have shape (batch_size + 1)` with FlashAttention-2

One-paragraph explanation: the diffusion expert uses a 4D additive attention mask to handle variable-length VLM rollouts, which FA2 cannot consume. The shipped configuration keeps the VLM on FA2 and the expert on SDPA — so the error only appears when users force FA2 globally. Mentions the expert_cfg.attn_implementation = "sdpa" escape hatch.

→ Addresses #52. Pairs with PR #75 which makes this default explicit in code.

3. Verifying your local installation

Three quick import-level smoke tests (torch.cuda.is_available(), from alpamayo_r1.models.alpamayo_r1 import AlpamayoR1, huggingface-cli whoami) plus the end-to-end test_inference.py command, so users can bisect a broken environment in a few seconds.

→ Addresses #33.

Why this is worth doing as docs

Looking at the open-issue distribution: the same handful of root causes (HF auth, FA2 incompatibility with the expert, missing GPU/imports) explain a large fraction of bug reports. A focused Troubleshooting section is the smallest-blast-radius fix that also reduces maintainer triage load.

Test plan

Markdown renders cleanly.
Anchors #3-authenticate-with-huggingface resolve to the existing heading.
No links to non-existent files or future PR numbers.

Address recurring questions in the issue tracker by extending the existing Troubleshooting section: 1. New entry for the IndexError that physical_ai_av raises when the gated PhysicalAI-Autonomous-Vehicles dataset is inaccessible — reproduces the exact traceback fragment users will see and links to the auth section. Addresses NVlabs#59, NVlabs#61. 2. New entry for the FA2 cu_seqlens_q crash in the diffusion expert path when FA2 is forced globally — explains the 4D-mask root cause in one paragraph and the shipped VLM-FA2 / expert-SDPA split. Addresses NVlabs#52. 3. New entry with three import-level smoke tests (torch.cuda, alpamayo_r1 import, huggingface-cli whoami) plus the end-to-end test_inference.py command, so users can rapidly bisect a broken environment. Addresses NVlabs#33. No code changes. Signed-off-by: lonexreb <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expand Troubleshooting with HF auth, FA2 cu_seqlens, and smoke tests#79

Expand Troubleshooting with HF auth, FA2 cu_seqlens, and smoke tests#79
lonexreb wants to merge 1 commit intoNVlabs:mainfrom
lonexreb:docs/expand-troubleshooting-section

lonexreb commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lonexreb commented May 3, 2026

Summary

What's added

1. IndexError: list index out of range when loading the dataset

2. RuntimeError: cu_seqlens_q must have shape (batch_size + 1) with FlashAttention-2

3. Verifying your local installation

Why this is worth doing as docs

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `IndexError: list index out of range` when loading the dataset

2. `RuntimeError: cu_seqlens_q must have shape (batch_size + 1)` with FlashAttention-2