Loss Turns to NaN After Several Hundred Ticks When Using Mixed-Precision Training

Hello,
I've encountered an issue where the loss turns to NaN after several hundred ticks when I enable mixed-precision training using the `--fp16=True` flag. And I noticed this line in the code:
https://github.com/locuslab/ect/blob/28c1223567802e8b667717457e2aa48827b0d29e/training/ct_training_loop.py#L117

I'm wondering if I should also adjust the `--ls` setting for loss scaling in conjunction with the `--fp16=True` flag. Could you advise what value the loss scaling should be set to under these conditions?

Additionally, are there other specific settings that should be configured to optimize mixed-precision training? For example, should the "learning rate" be modified together with "loss scaling"?

If possible, could you share the commands or configuration that you typically use for mixed-precision training?

Thank you very much in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss Turns to NaN After Several Hundred Ticks When Using Mixed-Precision Training #12

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Loss Turns to NaN After Several Hundred Ticks When Using Mixed-Precision Training #12

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions