-
Notifications
You must be signed in to change notification settings - Fork 27.4k
torch.nonzero slower than np.nonzero #14848
Copy link
Copy link
Closed
Labels
module: bootcampWe plan to do a full writeup on the issue, and then get someone to do it for onboardingWe plan to do a full writeup on the issue, and then get someone to do it for onboardingmodule: performanceIssues related to performance, either of kernel code or framework glueIssues related to performance, either of kernel code or framework gluetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
🐛 Bug
torch.nonzero is slower than np.nonzero.
Object detection libraries such as maskrcnn_benchmark heavily use this function in order to select the proposals, which might decrease inference time. Also, critical parts of Pytorch, such as indexing, rely on torch.nonzero.
To Reproduce
1D tensor of size 512
import numpy as np
import torch
data = np.random.randn(512)
t_data = torch.as_tensor(data)
ct_data = torch.as_tensor(data, device='cuda')
%timeit np.nonzero(data)
4.02 µs ± 54 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit torch.nonzero(t_data)
23.7 µs ± 269 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit torch.nonzero(ct_data)
31.6 µs ± 148 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)ND tensor
import numpy as np
import torch
data = np.random.randn(16, 3, 512)
t_data = torch.as_tensor(data)
ct_data = torch.as_tensor(data, device='cuda')
%timeit np.nonzero(data)
270 µs ± 2.58 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit torch.nonzero(t_data)
3.09 ms ± 181 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit torch.nonzero(ct_data)
38.9 µs ± 348 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)Expected behavior
CPU implementation of torch.nonzero should have a similar performance than np.nonzero, while GPU implementation should be faster (at least for high dimensional tensors)
Environment
Collecting environment information...
PyTorch version: 1.0.0.dev20181024
Is debug build: No
CUDA used to build PyTorch: 9.0.176
OS: Ubuntu 16.04.4 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
CMake version: version 3.5.1
Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 9.1.85
GPU models and configuration:
GPU 0: GeForce GTX 1080
GPU 1: GeForce GTX 1080
GPU 2: GeForce GTX 1080
GPU 3: GeForce GTX 1080
Nvidia driver version: 390.48
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.6.0.21
/usr/lib/x86_64-linux-gnu/libcudnn_static.a
/usr/local/cuda-8.0/lib64/libcudnn.so
/usr/local/cuda-8.0/lib64/libcudnn.so.5
/usr/local/cuda-8.0/lib64/libcudnn.so.5.1.5
/usr/local/cuda-8.0/lib64/libcudnn.so.6
/usr/local/cuda-8.0/lib64/libcudnn.so.6.0.21
/usr/local/cuda-8.0/lib64/libcudnn_static.a
/usr/local/cuda-9.1/lib64/libcudnn.so
/usr/local/cuda-9.1/lib64/libcudnn.so.7
/usr/local/cuda-9.1/lib64/libcudnn.so.7.1.3
/usr/local/cuda-9.1/lib64/libcudnn_static.a
Versions of relevant libraries:
[pip] msgpack-numpy (0.4.3.2)
[pip] numpy (1.15.4)
[pip] pytorch-ignite (0.1.0)
[pip] torch (1.0.0.dev20181024)
[pip] torchaudio (0.1)
[pip] torchtext (0.3.1)
[pip] torchvision (0.2.1)
[pip] torchvision-nightly (0.2.1)
[conda] pytorch 0.4.1 py36_py35_py27__9.0.176_7.1.2_2 pytorch
[conda] pytorch-nightly 1.0.0.dev20181024 py3.6_cuda9.0.176_cudnn7.1.2_0 pytorch
[conda] torchaudio 0.1 <pip>
[conda] torchtext 0.3.1 <pip>
[conda] torchvision 0.2.1 py36_1 pytorch
[conda] torchvision-nightly 0.2.1 <pip>
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
module: bootcampWe plan to do a full writeup on the issue, and then get someone to do it for onboardingWe plan to do a full writeup on the issue, and then get someone to do it for onboardingmodule: performanceIssues related to performance, either of kernel code or framework glueIssues related to performance, either of kernel code or framework gluetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module