-
Notifications
You must be signed in to change notification settings - Fork 27.4k
"Reduce Failed to Synchronise" in F.binary_cross_entropy #5560
Copy link
Copy link
Closed
Labels
module: cudaRelated to torch.cuda, and CUDA support in generalRelated to torch.cuda, and CUDA support in generaltodoNot as important as medium or high priority tasks, but we will work on these.Not as important as medium or high priority tasks, but we will work on these.triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
Since upgrading PyTorch to the master branch, I am occasionally receiving the following error:
/home/user/cuda-ubuntu-16.04-ec2/pytorch/aten/src/THCUNN/BCECriterion.cu:30: Acctype bce_functor<Dtype, Acctype>::operator()(Tuple) [with Tuple = thrust::detail::tuple_of_iterator_references<thrust::device_reference<float>, thrust::device_reference<float>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>, Dtype = float, Acctype = float]: block: [0,0,0], thread: [223,0,0] Assertion `input >= 0. && input <= 1.` failed.
Traceback (most recent call last):
File "train_model.py", line 138, in <module>
train_model(config)
File "train_model.py", line 105, in train_model
worker.train(train_loader, plot_lr=plot_lr, on_iter=on_iter)
File "/home/user/src/worker.py", line 204, in train
time_loss = F.binary_cross_entropy(time_pred, time_hist.float())
File "/home/user/miniconda3/envs/cuda/lib/python3.6/site-packages/torch/nn/functional.py", line 1507, in binary_cross_entropy
return torch._C._nn.binary_cross_entropy(input, target, weight, size_average, reduce)
RuntimeError: reduce failed to synchronize: device-side assert triggered
In this trace, time_loss is the output of a linear network with nn.Sigmoid() on the output, and time_hist is from a binary dataset, which I am confident is correct (because I can complete multiple epoch before it fails).
I haven't checked if F.binary_cross_entropy_with_logits fixes the issue.
System details:
- OS: Ubuntu 16.0.4
- PyTorch version: 0.4.0a0+55c64e5
- How you installed PyTorch (conda, pip, source): source
- Python version: Python 3.6.1
- CUDA/cuDNN version: CUDA release 9.0, V9.0.176 / CUDNN 7005
- GPU models and configuration: 4x Nvidia M60
- GCC version (if compiling from source): GCC 4.4.7
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
module: cudaRelated to torch.cuda, and CUDA support in generalRelated to torch.cuda, and CUDA support in generaltodoNot as important as medium or high priority tasks, but we will work on these.Not as important as medium or high priority tasks, but we will work on these.triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module