[training] fix: aggregate TB memory metrics across PP group#3645
Open
lonexreb wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
Open
[training] fix: aggregate TB memory metrics across PP group#3645lonexreb wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
lonexreb wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
Conversation
Memory metrics emitted to TensorBoard / W&B / MLFlow / Comet are written only by the last rank (`world_size - 1`), but with pipeline parallelism peak GPU memory typically lives on the first PP stage (activation buildup). Without aggregation the dashboards reflect only the last PP stage and under-report true peak headroom. Add `reduce_max_memory_across_pp_group` and call it from `training_log` on every PP rank before the writer-gated section, so the writer rank emits the per-metric MAX across the pipeline. The reduction is a single bulk all-reduce gated by the existing `log_memory_to_tensorboard` config flag plus the tensorboard log interval, so the cost is one collective per logged step on PP > 1, and a no-op otherwise. Counter-style integer metrics (e.g. `alloc_retries`) are preserved as `int` so existing dashboards continue to render them correctly. Helper no-ops when distributed is uninitialized, the report is empty, or the PP group has a single rank, keeping unit-test paths and single-stage runs unaffected. Adds unit tests for the helper covering the no-op fallbacks (empty report, dist not initialized, pp size == 1, group missing `.size`), the happy-path bulk MAX reduction, and the int-preservation contract. Refs issue NVIDIA-NeMo#3167 (and partially addresses the TB-rank concern in NVIDIA-NeMo#3164). Signed-off-by: lonexreb <[email protected]>
4 tasks
Contributor
|
/ok to test 2f7f48f |
cuichenx
previously approved these changes
May 5, 2026
Contributor
cuichenx
left a comment
There was a problem hiding this comment.
LGTM, @dingqingy-nv could you also review?
Contributor
|
/claude review |
Contributor
Light Code ReviewLGTM — clean implementation with solid test coverage. Notes
Suggested test cases No perf tests impacted. |
dingqingy-nv
previously approved these changes
May 5, 2026
Signed-off-by: Chen Cui <[email protected]>
Contributor
|
/ok to test 1f86d45 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
report_memory(...)across the pipeline-parallel group withMAXbefore the writer rank emits.reduce_max_memory_across_pp_grouphelper and call it fromtraining_logon every PP rank, gated by the existinglog_memory_to_tensorboardconfig flag plus the tensorboard log interval — single bulk all-reduce per logged step on PP > 1, no-op otherwise.alloc_retries) are preserved asintso existing dashboards continue to render correctly.Refs #3167. Partially addresses the TB-rank concern in #3164.
Background
tensorboard_logger,wandb_logger,mlflow_logger, andcomet_loggerare lazily initialized only on the last rank (world_size - 1), seestate.py:182-228. With pipeline parallelism, that's the last PP rank — but peak GPU memory typically lives on the first PP stage (activation buildup), so the dashboards under-report true peak headroom and obscure how close a job is to OOM.The cleanest fix is to aggregate per-rank
torch.cuda.memory_stats()across the PP group withMAX, so the writer rank always emits the per-metric peak across the entire pipeline — strictly more useful than picking any single PP rank.Implementation Notes
log_memory_to_tensorboard=True(a config flag, identical on all ranks) and the iteration matches the tensorboard log interval. The existingloggers_exist-gated section then consumes the precomputed dict on the writer rank only.float64tensor instead of one collective per metric — minimizes overhead.memory_reportis empty,torch.distributedis uninitialized, the PP group has size 1, or the group object lacks a callable.size. Keeps unit-test paths and single-stage runs unaffected.memory_reportdict is not mutated.Test plan
python3 -m astparse of changed filesruff checkclean on changed filesruff format --checkclean on changed filesTestReduceMaxMemoryAcrossPpGroup:pp.size() == 1→ no-op (no all-reduce call).size→ defensive no-opalloc_retries) preserved asinttest_memory_tensorboard_loggingcontinues to pass (helper short-circuits whentorch.distributed.is_initialized()isFalsein unit tests)memory/peak_*scalars now reflect the per-metric max across pipeline stages instead of the last stage onlyRisk
Low. The change is isolated to
training_log's memory-logging block:log_memory_to_tensorboard=False.