Skip to content

"Reduce Failed to Synchronise" in F.binary_cross_entropy  #5560

@angusturner

Description

@angusturner

Since upgrading PyTorch to the master branch, I am occasionally receiving the following error:

/home/user/cuda-ubuntu-16.04-ec2/pytorch/aten/src/THCUNN/BCECriterion.cu:30: Acctype bce_functor<Dtype, Acctype>::operator()(Tuple) [with Tuple = thrust::detail::tuple_of_iterator_references<thrust::device_reference<float>, thrust::device_reference<float>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>, Dtype = float, Acctype = float]: block: [0,0,0], thread: [223,0,0] Assertion `input >= 0. && input <= 1.` failed.
Traceback (most recent call last):
  File "train_model.py", line 138, in <module>
    train_model(config)
  File "train_model.py", line 105, in train_model
    worker.train(train_loader, plot_lr=plot_lr, on_iter=on_iter)
  File "/home/user/src/worker.py", line 204, in train
    time_loss = F.binary_cross_entropy(time_pred, time_hist.float())
  File "/home/user/miniconda3/envs/cuda/lib/python3.6/site-packages/torch/nn/functional.py", line 1507, in binary_cross_entropy
    return torch._C._nn.binary_cross_entropy(input, target, weight, size_average, reduce)
RuntimeError: reduce failed to synchronize: device-side assert triggered

In this trace, time_loss is the output of a linear network with nn.Sigmoid() on the output, and time_hist is from a binary dataset, which I am confident is correct (because I can complete multiple epoch before it fails).

I haven't checked if F.binary_cross_entropy_with_logits fixes the issue.

System details:

  • OS: Ubuntu 16.0.4
  • PyTorch version: 0.4.0a0+55c64e5
  • How you installed PyTorch (conda, pip, source): source
  • Python version: Python 3.6.1
  • CUDA/cuDNN version: CUDA release 9.0, V9.0.176 / CUDNN 7005
  • GPU models and configuration: 4x Nvidia M60
  • GCC version (if compiling from source): GCC 4.4.7

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: cudaRelated to torch.cuda, and CUDA support in generaltodoNot as important as medium or high priority tasks, but we will work on these.triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions