-
Notifications
You must be signed in to change notification settings - Fork 27.4k
Quantile is limited to 16 million elements and have poor performance. #64947
Description
🐛 Bug
x = torch.randn((17_000_000))
q = x.quantile(torch.tensor([0.1,0.5]))
Will throw that the tensor is too large: RuntimeError: quantile() input tensor is too large.
In addition to this the performance is really bad. Looking at the c++ here at Git shows a sort that returns the indexes is used which consumes a lot of memory. And the speed a very slow. I computed some quantiles of a 16 million element tensor and it took 2.3 s. The equivalent operation in numpy took 0.2 s.
To Reproduce
See above.
Expected behavior
That I can compute the quantile of very large tensors, that it requires much less memory than now and that it is about 10 times faster.
Environment
PyTorch version: 1.9.0+cu111
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A
OS: Microsoft Windows 10 Home
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A
Python version: 3.9.7 (tags/v3.9.7:1016ef3, Aug 30 2021, 20:19:38) [MSC v.1929 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.19042-SP0
Is CUDA available: True
CUDA runtime version: 11.4.100
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3070 Laptop GPU
Nvidia driver version: 471.41
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.21.2
[pip3] torch==1.9.0+cu111
[pip3] torchaudio==0.9.0
[pip3] torchvision==0.10.0+cu111
[conda] Could not collect
cc @msaroufim @jerryzh168 @mruberry @rgommers @VitalyFedyunin @ngimel @heitorschueroff