[ready for review] Batched upper triangular, lower triangular #15257

vishwakftw · 2018-12-15T06:58:09Z

Changelog:

Implements triu and tril for batches of 2D tensors.
Remove TH/THC binding for tril
Fix CUDA implementation
Update docstrings for tril and triu.
Remove mask-based triu and tril in cholesky forward and backward.
Remove batched tril in torch.distributions.utils

Test plan:

Add tests for tril and triu for CPU and CUDA.

Fixes #15016, fixes #15226 and closes #14071

Acknowledgements:

Thanks to @t-vi whose implementation I used as a reference.

vishwakftw · 2018-12-15T09:55:41Z

I've tried to debug this error, but to no avail. Posting it below:

Dec 15 07:42:30 /var/lib/jenkins/workspace/aten/src/ATen/cuda/CUDAApplyUtils.cuh(331): error: no instance of overloaded function "at::native::BatchTensorTriOp<T, upper>::operator() [with T=uint8_t, upper=false]" matches the argument list
Dec 15 07:42:30             argument types are: (uint8_t, uint8_t)
Dec 15 07:42:30             object type is: const at::native::BatchTensorTriOp<uint8_t, false>
Dec 15 07:42:30           detected during:
Dec 15 07:42:30             instantiation of "void at::cuda::<unnamed>::ApplyOp2<Op, scalar1, scalar2, IndexType, ADims, BDims, 0, Offset>::apply(at::cuda::detail::TensorInfo<scalar1, IndexType> &, at::cuda::detail::TensorInfo<scalar2, IndexType> &, const Op &, int, IndexType, Offset, Offset) [with Op=at::native::BatchTensorTriOp<uint8_t, false>, scalar1=uint8_t, scalar2=uint8_t, IndexType=unsigned int, ADims=1, BDims=1, Offset=const unsigned int]" 
Dec 15 07:42:30 (310): here
Dec 15 07:42:30             instantiation of "void at::cuda::<unnamed>::ApplyOp2<Op, scalar1, scalar2, IndexType, ADims, BDims, remaining_steps, Offsets...>::apply(at::cuda::detail::TensorInfo<scalar1, IndexType> &, at::cuda::detail::TensorInfo<scalar2, IndexType> &, const Op &, int, IndexType, Offsets..., Offsets...) [with Op=at::native::BatchTensorTriOp<uint8_t, false>, scalar1=uint8_t, scalar2=uint8_t, IndexType=unsigned int, ADims=1, BDims=1, remaining_steps=1, Offsets=<>]" 
Dec 15 07:42:30 (369): here
Dec 15 07:42:30             instantiation of "void at::cuda::<unnamed>::kernelPointwiseApply2<Op,scalar1,scalar2,IndexType,ADims,BDims,step>(at::cuda::detail::TensorInfo<scalar1, IndexType>, at::cuda::detail::TensorInfo<scalar2, IndexType>, IndexType, Op) [with Op=at::native::BatchTensorTriOp<uint8_t, false>, scalar1=uint8_t, scalar2=uint8_t, IndexType=unsigned int, ADims=1, BDims=1, step=1]" 
Dec 15 07:42:30 (888): here
Dec 15 07:42:30             instantiation of "__nv_bool at::cuda::CUDA_tensor_apply2<scalar1,scalar2,step,Op>(at::Tensor, at::Tensor, Op, at::cuda::TensorArgType, at::cuda::TensorArgType) [with scalar1=uint8_t, scalar2=uint8_t, step=1, Op=at::native::BatchTensorTriOp<uint8_t, false>]" 
Dec 15 07:42:30 (940): here
Dec 15 07:42:30             instantiation of "__nv_bool at::cuda::CUDA_tensor_apply2<scalar1,scalar2,Op>(at::Tensor, at::Tensor, Op, at::cuda::TensorArgType, at::cuda::TensorArgType) [with scalar1=uint8_t, scalar2=uint8_t, Op=at::native::BatchTensorTriOp<uint8_t, false>]" 
Dec 15 07:42:30 /var/lib/jenkins/workspace/aten/src/ATen/native/cuda/BatchLinearAlgebra.cu(430): here
Dec 15 07:42:30             instantiation of "void at::native::apply_triu_tril<scalar_t,inplace,upper>(at::Tensor &, const at::Tensor &, int64_t) [with scalar_t=uint8_t, inplace=true, upper=false]" 
Dec 15 07:42:30 /var/lib/jenkins/workspace/aten/src/ATen/native/cuda/BatchLinearAlgebra.cu(437): here

zou3519 · 2018-12-17T22:44:49Z

@vishwakftw do you get the same error message on a local build?

vishwakftw · 2018-12-17T22:46:24Z

Yes, I do.

vishwakftw · 2018-12-18T07:22:29Z

For additional context about the design - the design for the CUDA implementation is similar to THC implementation, where a custom op (TensorTriOp to be specific: defined in THC/THCTensorMathPairwise.cu) is instantiated and passed to THC_pointwiseApplyN.

aten/src/ATen/native/cuda/BatchLinearAlgebra.cu

vishwakftw · 2018-12-21T17:10:27Z

@zou3519 is this good to go?

Also, should I remove the TH/THC implementations as well?

vishwakftw · 2018-12-26T14:33:18Z

I did some benchmarking of the PR versus current master for the triu / tril ops on CPU and CUDA. Below are links to the results:
triu CPU : https://gist.github.com/vishwakftw/83b11cdcb18db19365d6c756cf214665
tril CPU : https://gist.github.com/vishwakftw/87f8d55b32c8586599fe6a2c25a186e1
triu CUDA : https://gist.github.com/vishwakftw/403e4d9d5d404c64a6456d5c4a5cbfe6
tril CUDA : https://gist.github.com/vishwakftw/7cdba13940c04b205e721e27f3b3e7f9

…timize CPU implementation - The thrust implementation seemed to be incredibly slow - The CPU implementation was bottlenecked by a clone() op - Add test cases for non-square matrices, and an addition based test

zou3519 · 2018-12-26T15:15:30Z

@vishwakftw I'll take a look later today

vishwakftw · 2018-12-26T15:17:17Z

Thank you, much obliged.

zou3519 · 2018-12-26T21:24:39Z

No, thank you for the contribution :)

zou3519 · 2018-12-26T21:25:26Z

Distributions tests seem to be failing

torch/_torch_docs.py

test/test_torch.py

aten/src/ATen/Declarations.cwrap

zou3519

CPU code looks fine, still reading the CUDA kernel

This reverts commit e907c74.

- Add edge cases - Remove redundant tests - Fix for non-contiguous case - Pop batch_tril in torch.distributions.utils - Remove tril in TH only (cannot remove triu due to requirement in THTensorLapack, cannot remove either tril or triu due to dependencies in THCTensorMathMagma.cu)

vishwakftw · 2018-12-27T07:07:56Z

CUDA tests are failing with this error: RuntimeError: cuda runtime error (9) : invalid configuration argument at /var/lib/jenkins/workspace/aten/src/THC/THCTensorMathCompareT.cuh:69

…enforce contiguity implicitly.

vishwakftw · 2019-01-05T06:01:20Z

@ngimel @zou3519 I have made changes as recommended. Could you please take a look?

ngimel

Did not check the cpu part, I requested small changes, but this is generally good now.

aten/src/ATen/native/LinearAlgebraUtils.h

aten/src/ATen/native/cuda/BatchLinearAlgebra.cu

aten/src/ATen/native/BatchLinearAlgebra.cpp

aten/src/ATen/native/cuda/BatchLinearAlgebra.cu

aten/src/TH/generic/THTensorMath.h

aten/src/TH/generic/THTensorMoreMath.cpp

aten/src/THC/generic/THCTensorMath.h

aten/src/THC/generic/THCTensorMathPairwise.cu

aten/src/ATen/native/cuda/BatchLinearAlgebra.cu

- Make the contiguous check weaker - Grid dimension computation simplification - Make (batch) contiguous checks more conservative

…hed-tril-triu

mrshenli

Thank you for taking care of this. Great work!

aten/src/ATen/native/BatchLinearAlgebra.cpp

- Move complete contiguity check to check function - Rename check function

…hed-tril-triu

facebook-github-bot

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: Changelog: - Implements `triu` and `tril` for batches of 2D tensors. - Remove TH/THC binding for `tril` - Fix CUDA implementation - Update docstrings for tril and triu. - Remove mask-based `triu` and `tril` in cholesky forward and backward. - Remove batched tril in torch.distributions.utils Pull Request resolved: pytorch/pytorch#15257 Differential Revision: D13613888 Pulled By: mrshenli fbshipit-source-id: 0949a05b9b8e974c1acfaf02a6284848ec5cc1c4

Summary: Changelog: - Implements `triu` and `tril` for batches of 2D tensors. - Remove TH/THC binding for `tril` - Fix CUDA implementation - Update docstrings for tril and triu. - Remove mask-based `triu` and `tril` in cholesky forward and backward. - Remove batched tril in torch.distributions.utils Pull Request resolved: pytorch#15257 Differential Revision: D13613888 Pulled By: mrshenli fbshipit-source-id: 0949a05b9b8e974c1acfaf02a6284848ec5cc1c4

Summary: Changelog: - Implements `triu` and `tril` for batches of 2D tensors. - Remove TH/THC binding for `tril` - Fix CUDA implementation - Update docstrings for tril and triu. - Remove mask-based `triu` and `tril` in cholesky forward and backward. - Remove batched tril in torch.distributions.utils Pull Request resolved: #15257 Differential Revision: D13613888 Pulled By: mrshenli fbshipit-source-id: 0949a05b9b8e974c1acfaf02a6284848ec5cc1c4

yongheng1991 · 2019-05-02T13:42:54Z

Hi,
Thanks very much for your contribution.
Have you ever considered the problem when the batch size is larger than 65536? I got an error when I did that. I think it might because of that you used "magma_int_t" to define the 'batch_size' in

pytorch/aten/src/ATen/native/cuda/BatchLinearAlgebra.cu

Line 775 in 646cb61

magma_int_t batch_size = magma_int_cast(batchCount(A), "batchCount");

And I think the "magma_int_t" uses 32-bit integer by default.
Best

vishwakftw · 2019-05-02T13:47:06Z

Hi Yongheng, thanks for the message. Are you facing issues in triu / tril or in triangular_solve?

yongheng1991 · 2019-05-02T14:01:53Z

Hi Yongheng, thanks for the message. Are you facing issues in triu / tril or in triangular_solve?

Yes. I am using it in a batch-wise eigenvalue decomposition. There is an error when the batch size is larger than 65535.
Here is a part of code:

        auto tmp_gxu =at::triu(gx.transpose(1, 2), 1);
        gx=gx.triu_().add_(tmp_gxu);

And this is the cuda error:

RuntimeError: CUDA error: invalid configuration argument (triu_tril_cuda_template at /opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/ATen/native/cuda/BatchLinearAlgebra.cu:709)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7fb0c0dd7dc5 in /home/zhao/anaconda3/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: at::Tensor& at::native::triu_tril_cuda_template(at::Tensor&, at::Tensor const&, long, char const*) + 0x2ce (0x7fb0c67bec0e in /home/zhao/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
frame #2: at::native::triu_cuda_out(at::Tensor&, at::Tensor const&, long) + 0xc6 (0x7fb0c67b27a6 in /home/zhao/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
frame #3: at::CUDAType::triu_out(at::Tensor&, at::Tensor const&, long) const + 0xd6 (0x7fb0c53fefa6 in /home/zhao/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
frame #4: at::native::triu(at::Tensor const&, long) + 0x6a (0x7fb0c163d94a in /home/zhao/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #5: at::TypeDefault::triu(at::Tensor const&, long) const + 0x5d (0x7fb0c1a749cd in /home/zhao/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #6: torch::autograd::VariableType::triu(at::Tensor const&, long) const + 0x44f (0x7fb0be7a0bbf in /home/zhao/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
frame #7: batch_symeig_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, bool) + 0x75c (0x7fb0aa68052c in /home/zhao/anaconda3/lib/python3.6/site-packages/torch_autograd_solver-0.0.0-py3.6-linux-x86_64.egg/torch_autograd_solver_aten.cpython-36m-x86_64-linux-gnu.so)
frame #8: + 0xeea4 (0x7fb0aa685ea4 in /home/zhao/anaconda3/lib/python3.6/site-packages/torch_autograd_solver-0.0.0-py3.6-linux-x86_64.egg/torch_autograd_solver_aten.cpython-36m-x86_64-linux-gnu.so)
frame #9: + 0xf13e (0x7fb0aa68613e in /home/zhao/anaconda3/lib/python3.6/site-packages/torch_autograd_solver-0.0.0-py3.6-linux-x86_64.egg/torch_autograd_solver_aten.cpython-36m-x86_64-linux-gnu.so)
frame #10: + 0x13585 (0x7fb0aa68a585 in /home/zhao/anaconda3/lib/python3.6/site-packages/torch_autograd_solver-0.0.0-py3.6-linux-x86_64.egg/torch_autograd_solver_aten.cpython-36m-x86_64-linux-gnu.so)

frame #18: THPFunction_do_backward(THPFunction*, _object*) + 0xf6 (0x7fb0f005d8a6 in /home/zhao/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #22: torch::autograd::PyFunction::legacy_apply(std::vector<torch::autograd::Variable, std::allocatortorch::autograd::Variable > const&) + 0xdf (0x7fb0f005dc4f in /home/zhao/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #23: torch::autograd::PyFunction::apply(std::vector<torch::autograd::Variable, std::allocatortorch::autograd::Variable >&&) + 0x837 (0x7fb0f005fa57 in /home/zhao/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #24: + 0x307622 (0x7fb0be40e622 in /home/zhao/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
frame #25: torch::autograd::Engine::evaluate_function(torch::autograd::FunctionTask&) + 0x385 (0x7fb0be407745 in /home/zhao/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
frame #26: torch::autograd::Engine::thread_main(torch::autograd::GraphTask*) + 0xc0 (0x7fb0be409740 in /home/zhao/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
frame #27: torch::autograd::Engine::thread_init(int) + 0x2b0 (0x7fb0be4069e0 in /home/zhao/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
frame #28: torch::autograd::python::PythonEngine::thread_init(int) + 0x2a (0x7fb0f0059d1a in /home/zhao/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #29: + 0xb8678 (0x7fb0bf4d0678 in /home/zhao/anaconda3/lib/python3.6/site-packages/torch/lib/../../../../libstdc++.so.6)
frame #30: + 0x76ba (0x7fb0ff7786ba in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #31: clone + 0x6d (0x7fb0ff4ae41d in /lib/x86_64-linux-gnu/libc.so.6)

vishwakftw · 2019-05-02T14:05:14Z

I think I know the reason for this: the batches for the triu and tril use the y-dimension of the grid, whose limit is 65535. I will try to fix within the next week. For now, you could try mini-batching them in a loop. Sorry about the inconvenience.

yongheng1991 · 2019-05-02T14:09:42Z

I think I know the reason for this: the batches for the triu and tril use the y-dimension of the grid, whose limit is 65535. I will try to fix within the next week. For now, you could try mini-batching them in a loop. Sorry about the inconvenience.

Thanks very much. Glad to help

vishwakftw · 2019-05-30T04:54:00Z

@yongheng1991 Please take a look at #21067 which adds support for batch sizes > 65535.

vishwakftw mentioned this pull request Dec 17, 2018

torch.tril and torch.triu produce incorrect results with device='cuda' #15226

Closed

zou3519 reviewed Dec 18, 2018

View reviewed changes

aten/src/ATen/native/cuda/BatchLinearAlgebra.cu Outdated Show resolved Hide resolved

vishwakftw changed the title ~~[WIP] Batched upper triangular, lower triangular~~ [ready for review] Batched upper triangular, lower triangular Dec 19, 2018

vishwakftw added 2 commits December 26, 2018 19:25

Upper/Lower triangular for batch of matrices

b980ef9

Use thrust for CUDA tril/triu, update doc strings, add new tests

6a0dacd

vishwakftw force-pushed the batched-tril-triu branch from 6231624 to a9d844f Compare December 26, 2018 14:33

vishwakftw force-pushed the batched-tril-triu branch from a9d844f to 18d611d Compare December 26, 2018 14:34

Use a custom kernel instead of thrust for the CUDA implementation, op…

62f31c4

…timize CPU implementation - The thrust implementation seemed to be incredibly slow - The CPU implementation was bottlenecked by a clone() op - Add test cases for non-square matrices, and an addition based test

vishwakftw force-pushed the batched-tril-triu branch from 18d611d to 62f31c4 Compare December 26, 2018 14:35

Remove batch_tril in torch.distributions.utils

e907c74