Skip to content

feat(megatron): expose hardcoded infrastructure params to user config#2390

Open
lonexreb wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
lonexreb:feat/2229-expose-megatron-infra-params
Open

feat(megatron): expose hardcoded infrastructure params to user config#2390
lonexreb wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
lonexreb:feat/2229-expose-megatron-infra-params

Conversation

@lonexreb
Copy link
Copy Markdown

@lonexreb lonexreb commented May 4, 2026

Summary

Closes #2229 (subset — see "Out of scope" below).

Several Megatron-LM infrastructure knobs were hard-coded inside
`nemo_rl/models/megatron/setup.py` even though the underlying
framework already supports them as configurable. Users running on
varied storage backends (NFS) or training very large models (e.g.
Nemotron 3 Super 120B) had to patch the source file to override them
— the workaround documented in the issue itself.

Expose four of those knobs as `NotRequired` fields on
`MegatronConfig`. When unset, the historical hard-coded behaviour
is preserved exactly, so existing user configs are untouched.

What's exposed

Config key Old hard-coded value Why expose it
`policy.megatron_cfg.distributed_timeout_minutes` NCCL default (~10 min) Initial weight conversion for 100B+ models on NFS exceeds the default and crashes `init_process_group`
`policy.megatron_cfg.async_save` `False` Non-blocking checkpoint writes; the Megatron-Core plumbing already exists, this just turns it on
`policy.megatron_cfg.check_for_nan_in_grad` `True` Disable in steady-state production for ~per-step overhead reduction
`policy.megatron_cfg.logging_level` `0` Bump for debugging; previously not overridable

Wiring

  • `setup_distributed()` now accepts an optional policy config and threads `distributed_timeout_minutes` into `torch.distributed.init_process_group(timeout=...)`.
  • `_create_checkpoint_config()` accepts the policy config and reads `async_save`.
  • `_create_megatron_config()` reads `check_for_nan_in_grad` and `logging_level` from the `config` it already receives.
  • `MegatronPolicyWorker` passes `config` to `setup_distributed`.

Coverage

New tests in `tests/unit/models/megatron/test_megatron_setup.py`:

  • `TestSetupDistributedTimeout` — three test cases covering no-config / config-without-key / config-with-key. Patches `init_process_group` and asserts the right kwargs flow through.
  • `TestCreateCheckpointConfig` extended with three new tests for the `async_save` default-vs-override behaviour.

All marked `@pytest.mark.mcore` to fit the project's test-tier conventions.

Backward compatibility

All four fields are typed `NotRequired` and the code uses explicit `if "key" in config["megatron_cfg"]:` checks (matching the existing pattern in this file for `force_reconvert_from_hf` and friends). When a key is absent, the value used is identical to the previous hard-coded constant. No exemplar YAML or recipe needs to change for this PR to be safe.

The `grpo_math_1B_megatron.yaml` exemplar is updated with commented-out examples so users can copy/paste the right keys.

Out of scope

`fully_parallel_save`, `fully_parallel_load`, and `load_rng` are also called out in #2229. They're referenced in two places — `_create_checkpoint_config` and `setup_reference_model_state` — and exposing them affects checkpoint-format and reproducibility behaviour that could nudge convergence tests. Worth a focused follow-up PR with its own test plan rather than bundling here.

The unconstrained `accelerate` dependency is being addressed in #2388.

Test plan

  • `ruff check` — clean
  • `ruff format` — clean
  • `python3 -m py_compile` on all touched files — clean
  • CI `L0` (`L0_Unit_Tests_Other.sh` runs `tests/unit/models/megatron/test_megatron_setup.py` under `--mcore-only`)
  • Spot-check on a real run: confirm a config with `distributed_timeout_minutes: 120` actually extends the `init_process_group` timeout (best done by the maintainers with the 120B repro from Expose hardcoded Megatron infrastructure parameters to user config #2229)

Refs: #2229

…NVIDIA-NeMo#2229)

Several Megatron-LM infrastructure knobs were hard-coded inside
``nemo_rl/models/megatron/setup.py`` even though the underlying
framework already supports them as configurable. Users running on
varied storage backends (NFS) or training very large models (e.g.
Nemotron 3 Super 120B) had to patch the source file to override them.

Expose the following four knobs as ``NotRequired`` fields on
``MegatronConfig``:

* ``distributed_timeout_minutes`` -- forwarded as ``timeout`` to
  ``torch.distributed.init_process_group``. The default NCCL timeout
  (~10 minutes) is too short for initial weight conversion of very
  large models on slow storage.
* ``async_save`` -- enables non-blocking checkpoint writes via
  Megatron-Core's existing async save plumbing.
* ``check_for_nan_in_grad`` -- toggles Megatron-LM's per-step
  NaN/Inf grad check; useful for debugging but adds non-trivial
  overhead in steady-state training.
* ``logging_level`` -- forwarded to Megatron-LM's ``LoggerConfig``.

When any key is absent the historical hard-coded behaviour is
preserved, so existing user configs continue to work unchanged.

Plumbing:

* ``setup_distributed()`` now optionally accepts the policy config
  and threads ``distributed_timeout_minutes`` through to
  ``init_process_group``.
* ``_create_checkpoint_config()`` accepts the policy config and reads
  ``async_save``.
* ``_create_megatron_config()`` reads ``check_for_nan_in_grad`` and
  ``logging_level`` from the existing ``config`` it already receives.
* ``MegatronPolicyWorker`` passes ``config`` to ``setup_distributed``.

Coverage:

* New ``TestSetupDistributedTimeout`` exercises the three timeout
  branches (no config / config without key / config with key).
* ``TestCreateCheckpointConfig`` gains three tests covering the
  ``async_save`` default and override behaviour.

Out of scope (intentionally left for a follow-up to keep this PR
reviewable): ``fully_parallel_save``, ``fully_parallel_load``, and
``load_rng``. These are referenced in two places (the main and
reference checkpoint configs) and changing them could affect
convergence-test behaviour, so they deserve their own focused PR.

Refs: NVIDIA-NeMo#2229
Signed-off-by: lonexreb <[email protected]>
@lonexreb lonexreb requested review from a team as code owners May 4, 2026 07:38
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

lonexreb added a commit to lonexreb/RL that referenced this pull request May 4, 2026
…#2229)

Address review feedback on PR NVIDIA-NeMo#2391: the previous

    megatron_cfg = config["megatron_cfg"] if config and "megatron_cfg" in config else {}
    async_save = megatron_cfg.get("async_save", False) if megatron_cfg else False

was redundant — when ``megatron_cfg`` is the empty dict, ``.get()``
already returns the default, and the ``if megatron_cfg else default``
chain only differs from a bare ``.get()`` when ``megatron_cfg`` is
falsy, in which case it is the empty dict and ``.get()`` returns the
same default. Simplify to four bare ``.get()`` calls and switch the
``config`` truthiness check to an explicit ``config is not None``,
matching the pattern in PR NVIDIA-NeMo#2390 and aligning the two PRs' style.

Behaviour is identical: defaults match the historical hard-coded
values when keys are absent, and explicit user values flow through.
The existing tests in ``TestCreateCheckpointConfig`` continue to
exercise both branches.

Refs: NVIDIA-NeMo#2229
Signed-off-by: lonexreb <[email protected]>
@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request waiting-on-maintainers Waiting on maintainers to respond

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expose hardcoded Megatron infrastructure parameters to user config

2 participants