Skip to content

feat(megatron): expose checkpoint parallelism and RNG knobs#2391

Open
lonexreb wants to merge 3 commits intoNVIDIA-NeMo:mainfrom
lonexreb:feat/2229-expose-checkpoint-parallelism
Open

feat(megatron): expose checkpoint parallelism and RNG knobs#2391
lonexreb wants to merge 3 commits intoNVIDIA-NeMo:mainfrom
lonexreb:feat/2229-expose-checkpoint-parallelism

Conversation

@lonexreb
Copy link
Copy Markdown

@lonexreb lonexreb commented May 4, 2026

Summary

Stacks on #2390. Completes the rest of the issue scope from #2229
that was deferred from #2390.

Three more Megatron-LM checkpoint knobs are now read from
`policy.megatron_cfg` instead of being hard-coded:

Config key Old hard-coded value Why expose it
`policy.megatron_cfg.fully_parallel_save` `True` Some storage backends prefer the non-parallel path; disabling also simplifies debugging
`policy.megatron_cfg.fully_parallel_load` `True` Same — useful for both NFS quirks and step-by-step debugging
`policy.megatron_cfg.load_rng` `False` Enable to include/restore RNG state for bit-exact resume

When unset the historical hard-coded behaviour is preserved exactly,
so existing user configs are untouched.

Wiring

Coverage

`tests/unit/models/megatron/test_megatron_setup.py`:

  • `test_parallelism_and_rng_flags_default_when_absent` — verifies the
    three keys default to their historical values when omitted.
  • `test_parallelism_and_rng_flags_overridable_via_config` — verifies
    that user values flow through the call.

The reference-model code path is exercised by the existing end-to-end
Megatron tests (no new heavy fixtures needed).

Stacking note

This branch is built on top of the `feat/2229-expose-megatron-infra-params`
branch from #2390. Until #2390 merges, the diff GitHub shows here will
include #2390's changes too. After #2390 merges, this PR's diff will
narrow to only the additions documented above (~76 lines).

Backward compatibility

All three new fields are typed `NotRequired`. Existing user configs
that don't set the keys see no behaviour change. The exemplar
`grpo_math_1B_megatron.yaml` is updated with commented-out examples
so users can copy/paste the right keys.

Test plan

  • `ruff check` — clean
  • `ruff format` — clean
  • `python3 -m py_compile` — clean
  • CI `L0` (`L0_Unit_Tests_Other.sh` runs the new test cases under
    `--mcore-only`)

Refs: #2229
Depends on: #2390

lonexreb added 2 commits May 4, 2026 02:37
…NVIDIA-NeMo#2229)

Several Megatron-LM infrastructure knobs were hard-coded inside
``nemo_rl/models/megatron/setup.py`` even though the underlying
framework already supports them as configurable. Users running on
varied storage backends (NFS) or training very large models (e.g.
Nemotron 3 Super 120B) had to patch the source file to override them.

Expose the following four knobs as ``NotRequired`` fields on
``MegatronConfig``:

* ``distributed_timeout_minutes`` -- forwarded as ``timeout`` to
  ``torch.distributed.init_process_group``. The default NCCL timeout
  (~10 minutes) is too short for initial weight conversion of very
  large models on slow storage.
* ``async_save`` -- enables non-blocking checkpoint writes via
  Megatron-Core's existing async save plumbing.
* ``check_for_nan_in_grad`` -- toggles Megatron-LM's per-step
  NaN/Inf grad check; useful for debugging but adds non-trivial
  overhead in steady-state training.
* ``logging_level`` -- forwarded to Megatron-LM's ``LoggerConfig``.

When any key is absent the historical hard-coded behaviour is
preserved, so existing user configs continue to work unchanged.

Plumbing:

* ``setup_distributed()`` now optionally accepts the policy config
  and threads ``distributed_timeout_minutes`` through to
  ``init_process_group``.
* ``_create_checkpoint_config()`` accepts the policy config and reads
  ``async_save``.
* ``_create_megatron_config()`` reads ``check_for_nan_in_grad`` and
  ``logging_level`` from the existing ``config`` it already receives.
* ``MegatronPolicyWorker`` passes ``config`` to ``setup_distributed``.

Coverage:

* New ``TestSetupDistributedTimeout`` exercises the three timeout
  branches (no config / config without key / config with key).
* ``TestCreateCheckpointConfig`` gains three tests covering the
  ``async_save`` default and override behaviour.

Out of scope (intentionally left for a follow-up to keep this PR
reviewable): ``fully_parallel_save``, ``fully_parallel_load``, and
``load_rng``. These are referenced in two places (the main and
reference checkpoint configs) and changing them could affect
convergence-test behaviour, so they deserve their own focused PR.

Refs: NVIDIA-NeMo#2229
Signed-off-by: lonexreb <[email protected]>
…eMo#2229)

Follow-up to the infrastructure-knob expose (PR NVIDIA-NeMo#2390) that completes
the rest of the parameters called out in NVIDIA-NeMo#2229. Three more knobs are
now read from ``policy.megatron_cfg`` instead of being hard-coded:

* ``fully_parallel_save``  (default True)
* ``fully_parallel_load``  (default True)
* ``load_rng``             (default False)

These were previously hard-coded in two places — the main
``_create_checkpoint_config`` call and ``setup_reference_model_state``'s
reference-model checkpoint config. Both call sites now read the same
keys, falling back to the historical hard-coded defaults when a key is
absent. Existing user configs that don't set the keys see no behavior
change.

Why the keys matter:

* ``fully_parallel_save/load`` toggle Megatron-Core's fully-parallel
  checkpoint I/O. Disabling them simplifies debugging at the cost of
  throughput; some storage backends prefer one mode over the other.
* ``load_rng`` controls whether RNG state is included in saved
  checkpoints and restored on load. Enable for bit-exact resume, leave
  disabled for typical training.

Coverage:

* ``TestCreateCheckpointConfig`` gains two tests covering the
  defaults-when-absent branch and the explicit-override branch for all
  three keys.
* The reference-model code path is exercised by the existing
  end-to-end Megatron tests.

Refs: NVIDIA-NeMo#2229
Signed-off-by: lonexreb <[email protected]>
@lonexreb lonexreb requested review from a team as code owners May 4, 2026 08:01
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…#2229)

Address review feedback on PR NVIDIA-NeMo#2391: the previous

    megatron_cfg = config["megatron_cfg"] if config and "megatron_cfg" in config else {}
    async_save = megatron_cfg.get("async_save", False) if megatron_cfg else False

was redundant — when ``megatron_cfg`` is the empty dict, ``.get()``
already returns the default, and the ``if megatron_cfg else default``
chain only differs from a bare ``.get()`` when ``megatron_cfg`` is
falsy, in which case it is the empty dict and ``.get()`` returns the
same default. Simplify to four bare ``.get()`` calls and switch the
``config`` truthiness check to an explicit ``config is not None``,
matching the pattern in PR NVIDIA-NeMo#2390 and aligning the two PRs' style.

Behaviour is identical: defaults match the historical hard-coded
values when keys are absent, and explicit user values flow through.
The existing tests in ``TestCreateCheckpointConfig`` continue to
exercise both branches.

Refs: NVIDIA-NeMo#2229
Signed-off-by: lonexreb <[email protected]>
@lonexreb
Copy link
Copy Markdown
Author

lonexreb commented May 4, 2026

Simplified the _create_checkpoint_config guard logic (commit 7f09256b) per code review feedback.

The previous

```python
megatron_cfg = config["megatron_cfg"] if config and "megatron_cfg" in config else {}
async_save = megatron_cfg.get("async_save", False) if megatron_cfg else False
```

was redundant — when `megatron_cfg` is the empty dict, `.get()` already returns the default, and the `if megatron_cfg else default` chain only differs from a bare `.get()` when `megatron_cfg` is falsy, in which case it is the empty dict and `.get()` returns the same default. The four reads are now bare `.get()` calls and the `config` truthiness check is now an explicit `config is not None` (matching the style in the parent #2390 PR).

Behaviour is identical and the existing tests continue to pass.

@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request waiting-on-maintainers Waiting on maintainers to respond

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants