Summary
PR #2910 added excellent Comet ML support to Megatron-Bridge with 18+ metric call sites in training_log() (thank you for that!). However, MoE-specific metrics (load balancing loss, z-loss, etc.) are not forwarded to Comet ML because track_moe_metrics() in Megatron-LM's moe_utils.py only writes to TensorBoard and W&B writers.
Current Behavior
In train_utils.py, training_log() calls track_moe_metrics() which accepts writer (TensorBoard) and wandb_writer parameters but has no comet_logger parameter. The MoE metrics are computed and reduced correctly but only written to TB/W&B:
track_moe_metrics(
loss_scale=moe_loss_scale,
iteration=iteration,
writer=writer,
wandb_writer=wandb_writer,
total_loss_dict=total_loss_dict,
...
)
The comet_logger is available in the same scope but not used.
Expected Behavior
MoE metrics (load_balancing_loss, seq_load_balancing_loss, global_load_balancing_loss, z_loss) should be forwarded to Comet ML alongside TB/W&B, matching how all other metrics in training_log() are dispatched to all three logging backends.
Environment
Summary
PR #2910 added excellent Comet ML support to Megatron-Bridge with 18+ metric call sites in
training_log()(thank you for that!). However, MoE-specific metrics (load balancing loss, z-loss, etc.) are not forwarded to Comet ML becausetrack_moe_metrics()in Megatron-LM'smoe_utils.pyonly writes to TensorBoard and W&B writers.Current Behavior
In
train_utils.py,training_log()callstrack_moe_metrics()which acceptswriter(TensorBoard) andwandb_writerparameters but has nocomet_loggerparameter. The MoE metrics are computed and reduced correctly but only written to TB/W&B:The
comet_loggeris available in the same scope but not used.Expected Behavior
MoE metrics (
load_balancing_loss,seq_load_balancing_loss,global_load_balancing_loss,z_loss) should be forwarded to Comet ML alongside TB/W&B, matching how all other metrics intraining_log()are dispatched to all three logging backends.Environment
combined/all-fixesbranch (post PR feat: Add first-class Comet ML experiment tracking #2910)f26190677nvcr.io/nvidia/nemo:26.02.nemotron_3_super