Skip to content

numerical instability for Adam and Adadelta optimizer #1767

@xuancong84

Description

@xuancong84

For Adam and Adadelta optimizer, when the model is close to convergence, the accuracy often suddenly drops to 0 with perplexity going to NAN, as shown below:

Epoch 3, 251750/348124; acc: 70.47; ppl: 3.77; 3911 tok/s; lr: 0.0010000; 717152.5 s elapsed
Epoch 3, 251800/348124; acc: 71.91; ppl: 3.53; 3796 tok/s; lr: 0.0010000; 717190.5 s elapsed
Epoch 3, 251850/348124; acc: 71.03; ppl: 3.58; 3752 tok/s; lr: 0.0010000; 717227.2 s elapsed
Epoch 3, 251900/348124; acc: 69.85; ppl: 3.86; 3830 tok/s; lr: 0.0010000; 717266.6 s elapsed
Epoch 3, 251950/348124; acc: 70.55; ppl: 3.73; 3930 tok/s; lr: 0.0010000; 717302.3 s elapsed
Epoch 3, 252000/348124; acc: 69.78; ppl: 4.03; 3912 tok/s; lr: 0.0010000; 717340.9 s elapsed
Epoch 3, 252050/348124; acc: 69.01; ppl: 4.18; 2699 tok/s; lr: 0.0010000; 717392.5 s elapsed
Epoch 3, 252100/348124; acc: 70.09; ppl: 3.90; 3935 tok/s; lr: 0.0010000; 717429.4 s elapsed
Epoch 3, 252150/348124; acc: 69.48; ppl: 4.18; 3758 tok/s; lr: 0.0010000; 717463.5 s elapsed
Epoch 3, 252200/348124; acc: 26.95; ppl: nan; 3753 tok/s; lr: 0.0010000; 717506.3 s elapsed
Epoch 3, 252250/348124; acc: 0.00; ppl: nan; 3925 tok/s; lr: 0.0010000; 717546.5 s elapsed
Epoch 3, 252300/348124; acc: 0.00; ppl: nan; 3822 tok/s; lr: 0.0010000; 717584.6 s elapsed
Epoch 3, 252350/348124; acc: 0.00; ppl: nan; 3813 tok/s; lr: 0.0010000; 717622.8 s elapsed
Epoch 3, 252400/348124; acc: 0.00; ppl: nan; 3677 tok/s; lr: 0.0010000; 717661.0 s elapsed
Epoch 3, 252450/348124; acc: 0.00; ppl: nan; 3999 tok/s; lr: 0.0010000; 717699.2 s elapsed
Epoch 3, 252500/348124; acc: 0.00; ppl: nan; 3939 tok/s; lr: 0.0010000; 717738.1 s elapsed
Epoch 3, 252550/348124; acc: 0.00; ppl: nan; 3872 tok/s; lr: 0.0010000; 717771.3 s elapsed

The code I have run is OpenNMT-py on a large dataset with 16M parallel sentences (Unite Nation Parallel Corpus v1.0), this phenomenon is observed on Adam and Adadelta which involves division, so far not seen on SGD. I suggest developers to check for divide by zero in Adam and Adadelta optimizers, and probably others.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions