Problem
I have observed couple of memory logging/profiling related issues in the past
- Memory usage reporting in the log file does not reflect peak usage.
- Memory snapshot is broken
- Tensorboard memory logging is on the last PP rank, cannot reflect proper peak.
Expected behavior
Memory snapshot should functional in normal runs.
- Bonus point: it should work in crash like OOM, or better even other types of crash to help debug.
Logging should help user understand best headroom to OOM
- reporting memory usage at least including the second iteration too
- first PP rank for TB logger.
Affected area
area:recipe
area:perf
Problem
I have observed couple of memory logging/profiling related issues in the past
Expected behavior
Memory snapshot should functional in normal runs.
Logging should help user understand best headroom to OOM
Affected area
area:recipe
area:perf