-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
🐛 Bug
I have adapted the mnist.cpp example to work with two GPUs using torch::nn::parallel::data_parallel. However, the loss does not decay, and the accuracy never improves beyond 0.114 (the original single GPU example reaches around 0.99 accuracy).
To Reproduce
Steps to reproduce the behavior:
- Compile and run the attached example (change mnist_parallel.cpp.txt to mnist_parallel.cpp).
mnist_parallel.cpp.txt
CMakeLists.txt
Expected behavior
I would expect that the network could improve accuracy during training, but it does not.
Environment
Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).
You can get the script and run it with:
PyTorch version: 1.0.1
Is debug build: No
CUDA used to build PyTorch: 10.0.130
OS: Ubuntu 16.04.6 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
CMake version: version 3.14.0
Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration:
GPU 0: GeForce GTX 1080 Ti
GPU 1: GeForce GTX 1080 Ti
Nvidia driver version: 410.104
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.5.0
/usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudnn.so.7
Versions of relevant libraries:
[pip3] numpy==1.15.0
[conda] blas 1.0 mkl
[conda] magma-cuda10 2.4.0 1 cpbotha
[conda] magma-cuda100 2.5.0 1 pytorch
[conda] magma-cuda90 2.5.0 1 pytorch
[conda] mkl 2019.3 199
[conda] mkl-include 2019.3 199
[conda] mkl-service 1.1.2 py36he904b0f_5
[conda] mkl_fft 1.0.10 py36ha843d7b_0
[conda] mkl_random 1.0.2 py36hd81dba3_0
[conda] mkldnn 0.16.1 0 mingfeima
[conda] pytorch 1.0.1 cuda100py36he554f03_0
[conda] torchvision 0.2.1 py36_0
Additional context
If this is not a bug, and I am setting up the code incorrectly, please let me know and I will issue a feature request instead for a good data_parallel C++ example.