Hi everyone,
Recently, I find that when training with multiple GPUs using the DataParallelTable, the performance is slightly worse than training with a single GPU. It seems that the problem is caused by that the mean and variance of batch normalization are not synchronized across different GPUs. Does anybody know the solution for this issue? Thank you very much.