Skip to content

[bug] Fix memory logging/profiling #3164

@dingqingy-nv

Description

@dingqingy-nv

Problem

I have observed couple of memory logging/profiling related issues in the past

  • Memory usage reporting in the log file does not reflect peak usage.
  • Memory snapshot is broken
  • Tensorboard memory logging is on the last PP rank, cannot reflect proper peak.

Expected behavior

Memory snapshot should functional in normal runs.

  • Bonus point: it should work in crash like OOM, or better even other types of crash to help debug.

Logging should help user understand best headroom to OOM

  • reporting memory usage at least including the second iteration too
  • first PP rank for TB logger.

Affected area

area:recipe
area:perf

Metadata

Metadata

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions