Skip to content

C++ Frontend data_parallel Does Not Update Weights #19540

@nmerrill67

Description

@nmerrill67

🐛 Bug

I have adapted the mnist.cpp example to work with two GPUs using torch::nn::parallel::data_parallel. However, the loss does not decay, and the accuracy never improves beyond 0.114 (the original single GPU example reaches around 0.99 accuracy).

To Reproduce

Steps to reproduce the behavior:

  1. Compile and run the attached example (change mnist_parallel.cpp.txt to mnist_parallel.cpp).
    mnist_parallel.cpp.txt
    CMakeLists.txt

Expected behavior

I would expect that the network could improve accuracy during training, but it does not.

Environment

Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).

You can get the script and run it with:
PyTorch version: 1.0.1
Is debug build: No
CUDA used to build PyTorch: 10.0.130

OS: Ubuntu 16.04.6 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
CMake version: version 3.14.0

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration:
GPU 0: GeForce GTX 1080 Ti
GPU 1: GeForce GTX 1080 Ti

Nvidia driver version: 410.104
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.5.0
/usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudnn.so.7

Versions of relevant libraries:
[pip3] numpy==1.15.0
[conda] blas 1.0 mkl
[conda] magma-cuda10 2.4.0 1 cpbotha
[conda] magma-cuda100 2.5.0 1 pytorch
[conda] magma-cuda90 2.5.0 1 pytorch
[conda] mkl 2019.3 199
[conda] mkl-include 2019.3 199
[conda] mkl-service 1.1.2 py36he904b0f_5
[conda] mkl_fft 1.0.10 py36ha843d7b_0
[conda] mkl_random 1.0.2 py36hd81dba3_0
[conda] mkldnn 0.16.1 0 mingfeima
[conda] pytorch 1.0.1 cuda100py36he554f03_0
[conda] torchvision 0.2.1 py36_0

Additional context

If this is not a bug, and I am setting up the code incorrectly, please let me know and I will issue a feature request instead for a good data_parallel C++ example.

Metadata

Metadata

Assignees

Labels

module: cppRelated to C++ APIoncall: distributedAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions