Hello,
I've encountered an issue where the loss turns to NaN after several hundred ticks when I enable mixed-precision training using the --fp16=True flag. And I noticed this line in the code:
|
loss_scaling = 1, # Loss scaling factor for reducing FP16 under/overflows. |
I'm wondering if I should also adjust the --ls setting for loss scaling in conjunction with the --fp16=True flag. Could you advise what value the loss scaling should be set to under these conditions?
Additionally, are there other specific settings that should be configured to optimize mixed-precision training? For example, should the "learning rate" be modified together with "loss scaling"?
If possible, could you share the commands or configuration that you typically use for mixed-precision training?
Thank you very much in advance!