Skip to content

track_moe_metrics() does not forward MoE metrics to Comet ML #2989

@LoganVegnaSHOP

Description

@LoganVegnaSHOP

Summary

PR #2910 added excellent Comet ML support to Megatron-Bridge with 18+ metric call sites in training_log() (thank you for that!). However, MoE-specific metrics (load balancing loss, z-loss, etc.) are not forwarded to Comet ML because track_moe_metrics() in Megatron-LM's moe_utils.py only writes to TensorBoard and W&B writers.

Current Behavior

In train_utils.py, training_log() calls track_moe_metrics() which accepts writer (TensorBoard) and wandb_writer parameters but has no comet_logger parameter. The MoE metrics are computed and reduced correctly but only written to TB/W&B:

track_moe_metrics(
    loss_scale=moe_loss_scale,
    iteration=iteration,
    writer=writer,
    wandb_writer=wandb_writer,
    total_loss_dict=total_loss_dict,
    ...
)

The comet_logger is available in the same scope but not used.

Expected Behavior

MoE metrics (load_balancing_loss, seq_load_balancing_loss, global_load_balancing_loss, z_loss) should be forwarded to Comet ML alongside TB/W&B, matching how all other metrics in training_log() are dispatched to all three logging backends.

Environment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions