When using gluon.nn.BatchNorm(scale=False) on gpu, the computed grad for beta is not correct. The grad of beta seem to be accumulated between iterations.
When setting scale=True or running on cpu, it goes correctly.
This problem may make network hard to converge during trainning.
Environment info (Required)
CentOS Linux release 7.2.1511 (Core)
GTX 1080Ti
Driver Version: 384.69
CUDA Version 9.0.176
installed with pip:
numpy 1.17.2
mxnet-cu90 1.5.0
Code
In this example, the grad of beta shuold be [1, 1, 1] at each iteration.
import mxnet as mx
from mxnet import gluon, autograd
ctx = mx.gpu()
x = mx.nd.ones((1,3,1,1), ctx=ctx)
net = gluon.nn.BatchNorm(scale=False, epsilon=2e-5, momentum=0.0)
net.initialize(ctx=ctx)
trainer = gluon.Trainer(params=net.collect_params(),
optimizer='sgd',
optimizer_params={'learning_rate': 0.01, 'wd': 0.0005, 'momentum': 0.9})
net.hybridize()
for i in range(10):
with autograd.record():
out = net(x)
out.backward()
trainer.step(x.shape[0])
for name, param in net.collect_params().items():
if 'beta' in name:
print(name, param.grad(ctx).asnumpy())
output:
batchnorm0_beta [1. 1. 1.]
batchnorm0_beta [2. 2. 2.]
batchnorm0_beta [3. 3. 3.]
batchnorm0_beta [4. 4. 4.]
batchnorm0_beta [5. 5. 5.]
batchnorm0_beta [6. 6. 6.]
batchnorm0_beta [7. 7. 7.]
batchnorm0_beta [8. 8. 8.]
batchnorm0_beta [9. 9. 9.]
batchnorm0_beta [10. 10. 10.]