Batched Triu And Tril Incorrect for Some Inputs

## 🐛 Bug

When passing a specific 4D tensor to `triu` or `tril`, the GPU implementation produces non-deterministic results. The CPU implementation produces `nan` values.

## To Reproduce

This is the minimum reproducible example I could come up with:

```python
x = torch.randn(1, 4, 4, 4)
x = x.transpose(0, 1)
for i in range(10):
    # note: results are often different on each run
    # or on CPU, outputs `nan`
    print(x.triu().sum())
```

Note that the issue does not seem to occur when:
- the transpose is omitted
- if `x.size(0) > 1`
- if the tensor has more or fewer dimensions

## Expected behavior

Values should be deterministic and not produces `NaN`

## Environment

PyTorch version: 1.1.0
Is debug build: No
CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 16.04.5 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
CMake version: Could not collect

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration: GPU 0: Tesla V100-SXM2-16GB
Nvidia driver version: 418.67
cuDNN version: 7501

Versions of relevant libraries:
[pip] numpy==1.16.1
[pip] torch==1.1.0
[pip] torchfile==0.1.0
[conda] torch                     1.1.0                     <pip>
[conda] torchfile                 0.1.0                     <pip>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Batched Triu And Tril Incorrect for Some Inputs #22581

🐛 Bug

To Reproduce

Expected behavior

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Batched Triu And Tril Incorrect for Some Inputs #22581

Description

🐛 Bug

To Reproduce

Expected behavior

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions