feat(megatron): expose hardcoded infrastructure params to user config by lonexreb · Pull Request #2390 · NVIDIA-NeMo/RL

lonexreb · 2026-05-04T07:38:00Z

Summary

Closes #2229 (subset — see "Out of scope" below).

Several Megatron-LM infrastructure knobs were hard-coded inside
`nemo_rl/models/megatron/setup.py` even though the underlying
framework already supports them as configurable. Users running on
varied storage backends (NFS) or training very large models (e.g.
Nemotron 3 Super 120B) had to patch the source file to override them
— the workaround documented in the issue itself.

Expose four of those knobs as `NotRequired` fields on
`MegatronConfig`. When unset, the historical hard-coded behaviour
is preserved exactly, so existing user configs are untouched.

What's exposed

Config key	Old hard-coded value	Why expose it
`policy.megatron_cfg.distributed_timeout_minutes`	NCCL default (~10 min)	Initial weight conversion for 100B+ models on NFS exceeds the default and crashes `init_process_group`
`policy.megatron_cfg.async_save`	`False`	Non-blocking checkpoint writes; the Megatron-Core plumbing already exists, this just turns it on
`policy.megatron_cfg.check_for_nan_in_grad`	`True`	Disable in steady-state production for ~per-step overhead reduction
`policy.megatron_cfg.logging_level`	`0`	Bump for debugging; previously not overridable

Wiring

`setup_distributed()` now accepts an optional policy config and threads `distributed_timeout_minutes` into `torch.distributed.init_process_group(timeout=...)`.
`_create_checkpoint_config()` accepts the policy config and reads `async_save`.
`_create_megatron_config()` reads `check_for_nan_in_grad` and `logging_level` from the `config` it already receives.
`MegatronPolicyWorker` passes `config` to `setup_distributed`.

Coverage

New tests in `tests/unit/models/megatron/test_megatron_setup.py`:

`TestSetupDistributedTimeout` — three test cases covering no-config / config-without-key / config-with-key. Patches `init_process_group` and asserts the right kwargs flow through.
`TestCreateCheckpointConfig` extended with three new tests for the `async_save` default-vs-override behaviour.

All marked `@pytest.mark.mcore` to fit the project's test-tier conventions.

Backward compatibility

All four fields are typed `NotRequired` and the code uses explicit `if "key" in config["megatron_cfg"]:` checks (matching the existing pattern in this file for `force_reconvert_from_hf` and friends). When a key is absent, the value used is identical to the previous hard-coded constant. No exemplar YAML or recipe needs to change for this PR to be safe.

The `grpo_math_1B_megatron.yaml` exemplar is updated with commented-out examples so users can copy/paste the right keys.

Out of scope

`fully_parallel_save`, `fully_parallel_load`, and `load_rng` are also called out in #2229. They're referenced in two places — `_create_checkpoint_config` and `setup_reference_model_state` — and exposing them affects checkpoint-format and reproducibility behaviour that could nudge convergence tests. Worth a focused follow-up PR with its own test plan rather than bundling here.

The unconstrained `accelerate` dependency is being addressed in #2388.

Test plan

`ruff check` — clean
`ruff format` — clean
`python3 -m py_compile` on all touched files — clean
CI `L0` (`L0_Unit_Tests_Other.sh` runs `tests/unit/models/megatron/test_megatron_setup.py` under `--mcore-only`)
Spot-check on a real run: confirm a config with `distributed_timeout_minutes: 120` actually extends the `init_process_group` timeout (best done by the maintainers with the 120B repro from Expose hardcoded Megatron infrastructure parameters to user config #2229)

Refs: #2229

…NVIDIA-NeMo#2229) Several Megatron-LM infrastructure knobs were hard-coded inside ``nemo_rl/models/megatron/setup.py`` even though the underlying framework already supports them as configurable. Users running on varied storage backends (NFS) or training very large models (e.g. Nemotron 3 Super 120B) had to patch the source file to override them. Expose the following four knobs as ``NotRequired`` fields on ``MegatronConfig``: * ``distributed_timeout_minutes`` -- forwarded as ``timeout`` to ``torch.distributed.init_process_group``. The default NCCL timeout (~10 minutes) is too short for initial weight conversion of very large models on slow storage. * ``async_save`` -- enables non-blocking checkpoint writes via Megatron-Core's existing async save plumbing. * ``check_for_nan_in_grad`` -- toggles Megatron-LM's per-step NaN/Inf grad check; useful for debugging but adds non-trivial overhead in steady-state training. * ``logging_level`` -- forwarded to Megatron-LM's ``LoggerConfig``. When any key is absent the historical hard-coded behaviour is preserved, so existing user configs continue to work unchanged. Plumbing: * ``setup_distributed()`` now optionally accepts the policy config and threads ``distributed_timeout_minutes`` through to ``init_process_group``. * ``_create_checkpoint_config()`` accepts the policy config and reads ``async_save``. * ``_create_megatron_config()`` reads ``check_for_nan_in_grad`` and ``logging_level`` from the existing ``config`` it already receives. * ``MegatronPolicyWorker`` passes ``config`` to ``setup_distributed``. Coverage: * New ``TestSetupDistributedTimeout`` exercises the three timeout branches (no config / config without key / config with key). * ``TestCreateCheckpointConfig`` gains three tests covering the ``async_save`` default and override behaviour. Out of scope (intentionally left for a follow-up to keep this PR reviewable): ``fully_parallel_save``, ``fully_parallel_load``, and ``load_rng``. These are referenced in two places (the main and reference checkpoint configs) and changing them could affect convergence-test behaviour, so they deserve their own focused PR. Refs: NVIDIA-NeMo#2229 Signed-off-by: lonexreb <[email protected]>

copy-pr-bot · 2026-05-04T07:38:04Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…#2229) Address review feedback on PR NVIDIA-NeMo#2391: the previous megatron_cfg = config["megatron_cfg"] if config and "megatron_cfg" in config else {} async_save = megatron_cfg.get("async_save", False) if megatron_cfg else False was redundant — when ``megatron_cfg`` is the empty dict, ``.get()`` already returns the default, and the ``if megatron_cfg else default`` chain only differs from a bare ``.get()`` when ``megatron_cfg`` is falsy, in which case it is the empty dict and ``.get()`` returns the same default. Simplify to four bare ``.get()`` calls and switch the ``config`` truthiness check to an explicit ``config is not None``, matching the pattern in PR NVIDIA-NeMo#2390 and aligning the two PRs' style. Behaviour is identical: defaults match the historical hard-coded values when keys are absent, and explicit user values flow through. The existing tests in ``TestCreateCheckpointConfig`` continue to exercise both branches. Refs: NVIDIA-NeMo#2229 Signed-off-by: lonexreb <[email protected]>

lonexreb requested review from a team as code owners May 4, 2026 07:38

github-actions Bot added the community-request label May 4, 2026

lonexreb mentioned this pull request May 4, 2026

feat(megatron): expose checkpoint parallelism and RNG knobs #2391

Open

4 tasks

svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label May 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(megatron): expose hardcoded infrastructure params to user config#2390

feat(megatron): expose hardcoded infrastructure params to user config#2390
lonexreb wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
lonexreb:feat/2229-expose-megatron-infra-params

lonexreb commented May 4, 2026

Uh oh!

copy-pr-bot Bot commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lonexreb commented May 4, 2026

Summary

What's exposed

Wiring

Coverage

Backward compatibility

Out of scope

Test plan

Uh oh!

copy-pr-bot Bot commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants