feat(megatron): expose hardcoded infrastructure params to user config#2390
Open
lonexreb wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
Open
feat(megatron): expose hardcoded infrastructure params to user config#2390lonexreb wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
lonexreb wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
Conversation
…NVIDIA-NeMo#2229) Several Megatron-LM infrastructure knobs were hard-coded inside ``nemo_rl/models/megatron/setup.py`` even though the underlying framework already supports them as configurable. Users running on varied storage backends (NFS) or training very large models (e.g. Nemotron 3 Super 120B) had to patch the source file to override them. Expose the following four knobs as ``NotRequired`` fields on ``MegatronConfig``: * ``distributed_timeout_minutes`` -- forwarded as ``timeout`` to ``torch.distributed.init_process_group``. The default NCCL timeout (~10 minutes) is too short for initial weight conversion of very large models on slow storage. * ``async_save`` -- enables non-blocking checkpoint writes via Megatron-Core's existing async save plumbing. * ``check_for_nan_in_grad`` -- toggles Megatron-LM's per-step NaN/Inf grad check; useful for debugging but adds non-trivial overhead in steady-state training. * ``logging_level`` -- forwarded to Megatron-LM's ``LoggerConfig``. When any key is absent the historical hard-coded behaviour is preserved, so existing user configs continue to work unchanged. Plumbing: * ``setup_distributed()`` now optionally accepts the policy config and threads ``distributed_timeout_minutes`` through to ``init_process_group``. * ``_create_checkpoint_config()`` accepts the policy config and reads ``async_save``. * ``_create_megatron_config()`` reads ``check_for_nan_in_grad`` and ``logging_level`` from the existing ``config`` it already receives. * ``MegatronPolicyWorker`` passes ``config`` to ``setup_distributed``. Coverage: * New ``TestSetupDistributedTimeout`` exercises the three timeout branches (no config / config without key / config with key). * ``TestCreateCheckpointConfig`` gains three tests covering the ``async_save`` default and override behaviour. Out of scope (intentionally left for a follow-up to keep this PR reviewable): ``fully_parallel_save``, ``fully_parallel_load``, and ``load_rng``. These are referenced in two places (the main and reference checkpoint configs) and changing them could affect convergence-test behaviour, so they deserve their own focused PR. Refs: NVIDIA-NeMo#2229 Signed-off-by: lonexreb <[email protected]>
4 tasks
lonexreb
added a commit
to lonexreb/RL
that referenced
this pull request
May 4, 2026
…#2229) Address review feedback on PR NVIDIA-NeMo#2391: the previous megatron_cfg = config["megatron_cfg"] if config and "megatron_cfg" in config else {} async_save = megatron_cfg.get("async_save", False) if megatron_cfg else False was redundant — when ``megatron_cfg`` is the empty dict, ``.get()`` already returns the default, and the ``if megatron_cfg else default`` chain only differs from a bare ``.get()`` when ``megatron_cfg`` is falsy, in which case it is the empty dict and ``.get()`` returns the same default. Simplify to four bare ``.get()`` calls and switch the ``config`` truthiness check to an explicit ``config is not None``, matching the pattern in PR NVIDIA-NeMo#2390 and aligning the two PRs' style. Behaviour is identical: defaults match the historical hard-coded values when keys are absent, and explicit user values flow through. The existing tests in ``TestCreateCheckpointConfig`` continue to exercise both branches. Refs: NVIDIA-NeMo#2229 Signed-off-by: lonexreb <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #2229 (subset — see "Out of scope" below).
Several Megatron-LM infrastructure knobs were hard-coded inside
`nemo_rl/models/megatron/setup.py` even though the underlying
framework already supports them as configurable. Users running on
varied storage backends (NFS) or training very large models (e.g.
Nemotron 3 Super 120B) had to patch the source file to override them
— the workaround documented in the issue itself.
Expose four of those knobs as `NotRequired` fields on
`MegatronConfig`. When unset, the historical hard-coded behaviour
is preserved exactly, so existing user configs are untouched.
What's exposed
Wiring
Coverage
New tests in `tests/unit/models/megatron/test_megatron_setup.py`:
All marked `@pytest.mark.mcore` to fit the project's test-tier conventions.
Backward compatibility
All four fields are typed `NotRequired` and the code uses explicit `if "key" in config["megatron_cfg"]:` checks (matching the existing pattern in this file for `force_reconvert_from_hf` and friends). When a key is absent, the value used is identical to the previous hard-coded constant. No exemplar YAML or recipe needs to change for this PR to be safe.
The `grpo_math_1B_megatron.yaml` exemplar is updated with commented-out examples so users can copy/paste the right keys.
Out of scope
`fully_parallel_save`, `fully_parallel_load`, and `load_rng` are also called out in #2229. They're referenced in two places — `_create_checkpoint_config` and `setup_reference_model_state` — and exposing them affects checkpoint-format and reproducibility behaviour that could nudge convergence tests. Worth a focused follow-up PR with its own test plan rather than bundling here.
The unconstrained `accelerate` dependency is being addressed in #2388.
Test plan
Refs: #2229