-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Closed
Labels
oncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
🐛 Bug
Function call:
torch.distributed.reduce(packed, dst=0)
Inputs:
packed = torch.cuda.FloatTensor([])
Trace:
File "/home/xxxx/src/utils.py", line 875, in log_step_end
torch.distributed.reduce(packed, dst=src.distributed.get_master_rank())
File "/home/xxxx/.local/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1002, in reduce
work = _default_pg.reduce([tensor], opts)
RuntimeError: invalid device pointer: %p0 (recordStream at /pytorch/c10/cuda/CUDACachingAllocator.cpp:384)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f6d5d33d441 in /home/michaelp/.local/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f6d5d33cd7a in /home/michaelp/.local/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::recordStream(void*, c10::cuda::CUDAStream) + 0x1f5 (0x7f6d5ae272b5 in /home/michaelp/.local/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x741bf2 (0x7f6d5deafbf2 in /home/michaelp/.local/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #4: c10d::ProcessGroupNCCL::reduce(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::ReduceOptions const&) + 0x35 (0x7f6d5deb08b5 in /home/michaelp/.local/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x6c327c (0x7f6d5de3127c in /home/michaelp/.local/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x130cfc (0x7f6d5d89ecfc in /home/michaelp/.local/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #7: _PyCFunction_FastCallDict + 0x35c (0x565d5c in /usr/bin/python3)
frame #8: /usr/bin/python3() [0x503073]
frame #9: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3)
frame #10: /usr/bin/python3() [0x504c28]
frame #11: /usr/bin/python3() [0x502540]
frame #12: /usr/bin/python3() [0x502f3d]
frame #13: _PyEval_EvalFrameDefault + 0x1231 (0x507641 in /usr/bin/python3)
frame #14: /usr/bin/python3() [0x502209]
frame #15: /usr/bin/python3() [0x502f3d]
frame #16: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3)
frame #17: /usr/bin/python3() [0x502209]
frame #18: /usr/bin/python3() [0x502f3d]
frame #19: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3)
frame #20: /usr/bin/python3() [0x504c28]
frame #21: /usr/bin/python3() [0x58650d]
frame #22: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
frame #23: _PyEval_EvalFrameDefault + 0x1807 (0x507c17 in /usr/bin/python3)
frame #24: /usr/bin/python3() [0x504c28]
frame #25: /usr/bin/python3() [0x502540]
frame #26: /usr/bin/python3() [0x502f3d]
frame #27: _PyEval_EvalFrameDefault + 0x1231 (0x507641 in /usr/bin/python3)
frame #28: /usr/bin/python3() [0x504c28]
frame #29: /usr/bin/python3() [0x502540]
frame #30: /usr/bin/python3() [0x502f3d]
frame #31: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3)
frame #32: /usr/bin/python3() [0x504c28]
frame #33: /usr/bin/python3() [0x502540]
frame #34: /usr/bin/python3() [0x502f3d]
frame #35: _PyEval_EvalFrameDefault + 0x1231 (0x507641 in /usr/bin/python3)
frame #36: /usr/bin/python3() [0x504c28]
frame #37: /usr/bin/python3() [0x511eca]
frame #38: /usr/bin/python3() [0x502d6f]
frame #39: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3)
frame #40: /usr/bin/python3() [0x504c28]
frame #41: /usr/bin/python3() [0x502540]
frame #42: /usr/bin/python3() [0x502f3d]
frame #43: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3)
frame #44: /usr/bin/python3() [0x504c28]
frame #45: /usr/bin/python3() [0x58659d]
frame #46: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
frame #47: /usr/bin/python3() [0x63835b]
frame #48: Py_Main + 0x448 (0x639028 in /usr/bin/python3)
frame #49: main + 0xe0 (0x4a6f10 in /usr/bin/python3)
frame #50: __libc_start_main + 0xe7 (0x7f6d65d39b97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #51: _start + 0x2a (0x5afa0a in /usr/bin/python3)
Environment
- PyTorch Version (e.g., 1.0): 1.1
- OS (e.g., Linux): Linux
- How you installed PyTorch (
conda,pip, source): pip - Python version: 3.6.7
- CUDA/cuDNN version: 10.1
- GPU models and configuration: 2 Tesla P100
Metadata
Metadata
Assignees
Labels
oncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module