feat(megatron): expose checkpoint parallelism and RNG knobs#2391
feat(megatron): expose checkpoint parallelism and RNG knobs#2391lonexreb wants to merge 3 commits intoNVIDIA-NeMo:mainfrom
Conversation
…NVIDIA-NeMo#2229) Several Megatron-LM infrastructure knobs were hard-coded inside ``nemo_rl/models/megatron/setup.py`` even though the underlying framework already supports them as configurable. Users running on varied storage backends (NFS) or training very large models (e.g. Nemotron 3 Super 120B) had to patch the source file to override them. Expose the following four knobs as ``NotRequired`` fields on ``MegatronConfig``: * ``distributed_timeout_minutes`` -- forwarded as ``timeout`` to ``torch.distributed.init_process_group``. The default NCCL timeout (~10 minutes) is too short for initial weight conversion of very large models on slow storage. * ``async_save`` -- enables non-blocking checkpoint writes via Megatron-Core's existing async save plumbing. * ``check_for_nan_in_grad`` -- toggles Megatron-LM's per-step NaN/Inf grad check; useful for debugging but adds non-trivial overhead in steady-state training. * ``logging_level`` -- forwarded to Megatron-LM's ``LoggerConfig``. When any key is absent the historical hard-coded behaviour is preserved, so existing user configs continue to work unchanged. Plumbing: * ``setup_distributed()`` now optionally accepts the policy config and threads ``distributed_timeout_minutes`` through to ``init_process_group``. * ``_create_checkpoint_config()`` accepts the policy config and reads ``async_save``. * ``_create_megatron_config()`` reads ``check_for_nan_in_grad`` and ``logging_level`` from the existing ``config`` it already receives. * ``MegatronPolicyWorker`` passes ``config`` to ``setup_distributed``. Coverage: * New ``TestSetupDistributedTimeout`` exercises the three timeout branches (no config / config without key / config with key). * ``TestCreateCheckpointConfig`` gains three tests covering the ``async_save`` default and override behaviour. Out of scope (intentionally left for a follow-up to keep this PR reviewable): ``fully_parallel_save``, ``fully_parallel_load``, and ``load_rng``. These are referenced in two places (the main and reference checkpoint configs) and changing them could affect convergence-test behaviour, so they deserve their own focused PR. Refs: NVIDIA-NeMo#2229 Signed-off-by: lonexreb <[email protected]>
…eMo#2229) Follow-up to the infrastructure-knob expose (PR NVIDIA-NeMo#2390) that completes the rest of the parameters called out in NVIDIA-NeMo#2229. Three more knobs are now read from ``policy.megatron_cfg`` instead of being hard-coded: * ``fully_parallel_save`` (default True) * ``fully_parallel_load`` (default True) * ``load_rng`` (default False) These were previously hard-coded in two places — the main ``_create_checkpoint_config`` call and ``setup_reference_model_state``'s reference-model checkpoint config. Both call sites now read the same keys, falling back to the historical hard-coded defaults when a key is absent. Existing user configs that don't set the keys see no behavior change. Why the keys matter: * ``fully_parallel_save/load`` toggle Megatron-Core's fully-parallel checkpoint I/O. Disabling them simplifies debugging at the cost of throughput; some storage backends prefer one mode over the other. * ``load_rng`` controls whether RNG state is included in saved checkpoints and restored on load. Enable for bit-exact resume, leave disabled for typical training. Coverage: * ``TestCreateCheckpointConfig`` gains two tests covering the defaults-when-absent branch and the explicit-override branch for all three keys. * The reference-model code path is exercised by the existing end-to-end Megatron tests. Refs: NVIDIA-NeMo#2229 Signed-off-by: lonexreb <[email protected]>
…#2229) Address review feedback on PR NVIDIA-NeMo#2391: the previous megatron_cfg = config["megatron_cfg"] if config and "megatron_cfg" in config else {} async_save = megatron_cfg.get("async_save", False) if megatron_cfg else False was redundant — when ``megatron_cfg`` is the empty dict, ``.get()`` already returns the default, and the ``if megatron_cfg else default`` chain only differs from a bare ``.get()`` when ``megatron_cfg`` is falsy, in which case it is the empty dict and ``.get()`` returns the same default. Simplify to four bare ``.get()`` calls and switch the ``config`` truthiness check to an explicit ``config is not None``, matching the pattern in PR NVIDIA-NeMo#2390 and aligning the two PRs' style. Behaviour is identical: defaults match the historical hard-coded values when keys are absent, and explicit user values flow through. The existing tests in ``TestCreateCheckpointConfig`` continue to exercise both branches. Refs: NVIDIA-NeMo#2229 Signed-off-by: lonexreb <[email protected]>
|
Simplified the The previous ```python was redundant — when `megatron_cfg` is the empty dict, `.get()` already returns the default, and the `if megatron_cfg else default` chain only differs from a bare `.get()` when `megatron_cfg` is falsy, in which case it is the empty dict and `.get()` returns the same default. The four reads are now bare `.get()` calls and the `config` truthiness check is now an explicit `config is not None` (matching the style in the parent #2390 PR). Behaviour is identical and the existing tests continue to pass. |
Summary
Stacks on #2390. Completes the rest of the issue scope from #2229
that was deferred from #2390.
Three more Megatron-LM checkpoint knobs are now read from
`policy.megatron_cfg` instead of being hard-coded:
When unset the historical hard-coded behaviour is preserved exactly,
so existing user configs are untouched.
Wiring
it now receives (added by feat(megatron): expose hardcoded infrastructure params to user config #2390).
`load_rng` so the reference model honours the same checkpoint-I/O
preferences as the main model.
Coverage
`tests/unit/models/megatron/test_megatron_setup.py`:
three keys default to their historical values when omitted.
that user values flow through the call.
The reference-model code path is exercised by the existing end-to-end
Megatron tests (no new heavy fixtures needed).
Stacking note
This branch is built on top of the `feat/2229-expose-megatron-infra-params`
branch from #2390. Until #2390 merges, the diff GitHub shows here will
include #2390's changes too. After #2390 merges, this PR's diff will
narrow to only the additions documented above (~76 lines).
Backward compatibility
All three new fields are typed `NotRequired`. Existing user configs
that don't set the keys see no behaviour change. The exemplar
`grpo_math_1B_megatron.yaml` is updated with commented-out examples
so users can copy/paste the right keys.
Test plan
`--mcore-only`)
Refs: #2229
Depends on: #2390