feat(megatron): expose checkpoint parallelism and RNG knobs by lonexreb · Pull Request #2391 · NVIDIA-NeMo/RL

lonexreb · 2026-05-04T08:01:30Z

Summary

Stacks on #2390. Completes the rest of the issue scope from #2229
that was deferred from #2390.

Three more Megatron-LM checkpoint knobs are now read from
`policy.megatron_cfg` instead of being hard-coded:

Config key	Old hard-coded value	Why expose it
`policy.megatron_cfg.fully_parallel_save`	`True`	Some storage backends prefer the non-parallel path; disabling also simplifies debugging
`policy.megatron_cfg.fully_parallel_load`	`True`	Same — useful for both NFS quirks and step-by-step debugging
`policy.megatron_cfg.load_rng`	`False`	Enable to include/restore RNG state for bit-exact resume

When unset the historical hard-coded behaviour is preserved exactly,
so existing user configs are untouched.

Wiring

`_create_checkpoint_config()` reads all three keys from the config
it now receives (added by feat(megatron): expose hardcoded infrastructure params to user config #2390).
`setup_reference_model_state()` mirrors `fully_parallel_load` and
`load_rng` so the reference model honours the same checkpoint-I/O
preferences as the main model.

Coverage

`tests/unit/models/megatron/test_megatron_setup.py`:

`test_parallelism_and_rng_flags_default_when_absent` — verifies the
three keys default to their historical values when omitted.
`test_parallelism_and_rng_flags_overridable_via_config` — verifies
that user values flow through the call.

The reference-model code path is exercised by the existing end-to-end
Megatron tests (no new heavy fixtures needed).

Stacking note

This branch is built on top of the `feat/2229-expose-megatron-infra-params`
branch from #2390. Until #2390 merges, the diff GitHub shows here will
include #2390's changes too. After #2390 merges, this PR's diff will
narrow to only the additions documented above (~76 lines).

Backward compatibility

All three new fields are typed `NotRequired`. Existing user configs
that don't set the keys see no behaviour change. The exemplar
`grpo_math_1B_megatron.yaml` is updated with commented-out examples
so users can copy/paste the right keys.

Test plan

`ruff check` — clean
`ruff format` — clean
`python3 -m py_compile` — clean
CI `L0` (`L0_Unit_Tests_Other.sh` runs the new test cases under
`--mcore-only`)

Refs: #2229
Depends on: #2390

…NVIDIA-NeMo#2229) Several Megatron-LM infrastructure knobs were hard-coded inside ``nemo_rl/models/megatron/setup.py`` even though the underlying framework already supports them as configurable. Users running on varied storage backends (NFS) or training very large models (e.g. Nemotron 3 Super 120B) had to patch the source file to override them. Expose the following four knobs as ``NotRequired`` fields on ``MegatronConfig``: * ``distributed_timeout_minutes`` -- forwarded as ``timeout`` to ``torch.distributed.init_process_group``. The default NCCL timeout (~10 minutes) is too short for initial weight conversion of very large models on slow storage. * ``async_save`` -- enables non-blocking checkpoint writes via Megatron-Core's existing async save plumbing. * ``check_for_nan_in_grad`` -- toggles Megatron-LM's per-step NaN/Inf grad check; useful for debugging but adds non-trivial overhead in steady-state training. * ``logging_level`` -- forwarded to Megatron-LM's ``LoggerConfig``. When any key is absent the historical hard-coded behaviour is preserved, so existing user configs continue to work unchanged. Plumbing: * ``setup_distributed()`` now optionally accepts the policy config and threads ``distributed_timeout_minutes`` through to ``init_process_group``. * ``_create_checkpoint_config()`` accepts the policy config and reads ``async_save``. * ``_create_megatron_config()`` reads ``check_for_nan_in_grad`` and ``logging_level`` from the existing ``config`` it already receives. * ``MegatronPolicyWorker`` passes ``config`` to ``setup_distributed``. Coverage: * New ``TestSetupDistributedTimeout`` exercises the three timeout branches (no config / config without key / config with key). * ``TestCreateCheckpointConfig`` gains three tests covering the ``async_save`` default and override behaviour. Out of scope (intentionally left for a follow-up to keep this PR reviewable): ``fully_parallel_save``, ``fully_parallel_load``, and ``load_rng``. These are referenced in two places (the main and reference checkpoint configs) and changing them could affect convergence-test behaviour, so they deserve their own focused PR. Refs: NVIDIA-NeMo#2229 Signed-off-by: lonexreb <[email protected]>

…eMo#2229) Follow-up to the infrastructure-knob expose (PR NVIDIA-NeMo#2390) that completes the rest of the parameters called out in NVIDIA-NeMo#2229. Three more knobs are now read from ``policy.megatron_cfg`` instead of being hard-coded: * ``fully_parallel_save`` (default True) * ``fully_parallel_load`` (default True) * ``load_rng`` (default False) These were previously hard-coded in two places — the main ``_create_checkpoint_config`` call and ``setup_reference_model_state``'s reference-model checkpoint config. Both call sites now read the same keys, falling back to the historical hard-coded defaults when a key is absent. Existing user configs that don't set the keys see no behavior change. Why the keys matter: * ``fully_parallel_save/load`` toggle Megatron-Core's fully-parallel checkpoint I/O. Disabling them simplifies debugging at the cost of throughput; some storage backends prefer one mode over the other. * ``load_rng`` controls whether RNG state is included in saved checkpoints and restored on load. Enable for bit-exact resume, leave disabled for typical training. Coverage: * ``TestCreateCheckpointConfig`` gains two tests covering the defaults-when-absent branch and the explicit-override branch for all three keys. * The reference-model code path is exercised by the existing end-to-end Megatron tests. Refs: NVIDIA-NeMo#2229 Signed-off-by: lonexreb <[email protected]>

copy-pr-bot · 2026-05-04T08:01:34Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…#2229) Address review feedback on PR NVIDIA-NeMo#2391: the previous megatron_cfg = config["megatron_cfg"] if config and "megatron_cfg" in config else {} async_save = megatron_cfg.get("async_save", False) if megatron_cfg else False was redundant — when ``megatron_cfg`` is the empty dict, ``.get()`` already returns the default, and the ``if megatron_cfg else default`` chain only differs from a bare ``.get()`` when ``megatron_cfg`` is falsy, in which case it is the empty dict and ``.get()`` returns the same default. Simplify to four bare ``.get()`` calls and switch the ``config`` truthiness check to an explicit ``config is not None``, matching the pattern in PR NVIDIA-NeMo#2390 and aligning the two PRs' style. Behaviour is identical: defaults match the historical hard-coded values when keys are absent, and explicit user values flow through. The existing tests in ``TestCreateCheckpointConfig`` continue to exercise both branches. Refs: NVIDIA-NeMo#2229 Signed-off-by: lonexreb <[email protected]>

lonexreb · 2026-05-04T08:57:15Z

Simplified the _create_checkpoint_config guard logic (commit 7f09256b) per code review feedback.

The previous

```python
megatron_cfg = config["megatron_cfg"] if config and "megatron_cfg" in config else {}
async_save = megatron_cfg.get("async_save", False) if megatron_cfg else False
```

was redundant — when `megatron_cfg` is the empty dict, `.get()` already returns the default, and the `if megatron_cfg else default` chain only differs from a bare `.get()` when `megatron_cfg` is falsy, in which case it is the empty dict and `.get()` returns the same default. The four reads are now bare `.get()` calls and the `config` truthiness check is now an explicit `config is not None` (matching the style in the parent #2390 PR).

Behaviour is identical and the existing tests continue to pass.

lonexreb added 2 commits May 4, 2026 02:37

lonexreb requested review from a team as code owners May 4, 2026 08:01

github-actions Bot added the community-request label May 4, 2026

svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label May 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(megatron): expose checkpoint parallelism and RNG knobs#2391

feat(megatron): expose checkpoint parallelism and RNG knobs#2391
lonexreb wants to merge 3 commits intoNVIDIA-NeMo:mainfrom
lonexreb:feat/2229-expose-checkpoint-parallelism

lonexreb commented May 4, 2026

Uh oh!

copy-pr-bot Bot commented May 4, 2026

Uh oh!

lonexreb commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lonexreb commented May 4, 2026

Summary

Wiring

Coverage

Stacking note

Backward compatibility

Test plan

Uh oh!

copy-pr-bot Bot commented May 4, 2026

Uh oh!

lonexreb commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants