Reductions returning scalars cause implicit sync-point

Scalar reduction results are currently being returned as host-side scalars, causing an implicit sync-point, and gpu->device copy.  See the following results:

```
=================
it 0
------ .sum(), returns a host-side scalar ---------
time for .sum() 0.040
time for synchronize 0.000
res 2000001152.0
---- .sum(1), returns proxy to gpu-side tensor -----
time for .sum(1) 0.000
time for synchronize 0.462
res
 2.0000e+09
[torch.cuda.FloatTensor of size 1x1 (GPU 0)]

=================
it 1
------ .sum(), returns a host-side scalar ---------
time for .sum() 0.039
time for synchronize 0.000
---- .sum(1), returns proxy to gpu-side tensor -----
time for .sum(1) 0.000
time for synchronize 0.462
=================
it 2
------ .sum(), returns a host-side scalar ---------
time for .sum() 0.040
time for synchronize 0.000
---- .sum(1), returns proxy to gpu-side tensor -----
time for .sum(1) 0.000
time for synchronize 0.456
```

You can see that `.sum(1)` returns instantly, and the pain doesnt happen till the later `.synchronize()`. But `.sum()` waits for the entire gpu operation to finish, and returns a host-side scalar.  `.sum()` is implicitly, behind the scenes:
- waiting for the entire operation to finish
- sync'ing
- copying the result from the gpu back to the hostside

I'm not sure this is desirable?  It's also inconsistent with how numpy works, which doesnt convert scalars into a non-numpy scalar until you call `.item()`.

Test code for the implicit sync points:

```
import torch
import time


N = 1000 * 1000 * 1000

def implicit_sync(print_res=False):
    print('------ .sum(), returns a host-side scalar ---------')
    # start = time.time()
    # a = torch.ones(N).cuda()
    # a += 1
    # torch.cuda.synchronize()
    start = time.time()
    res = a.sum()
    print('time for .sum() %.3f' % (time.time() - start))
    start = time.time()
    torch.cuda.synchronize()
    print('time for synchronize %.3f' % (time.time() - start))
    if print_res:
        print('res', res)

def stays_on_gpu(print_res=False):
    print('---- .sum(1), returns proxy to gpu-side tensor -----')
    start = time.time()
    res = a.sum(1)
    print('time for .sum(1) %.3f' % (time.time() - start))
    start = time.time()
    torch.cuda.synchronize()
    print('time for synchronize %.3f' % (time.time() - start))
    if print_res:
        print('res', res)


if __name__ == '__main__':
    a = torch.ones(1, N).cuda()
    a += 1
    torch.cuda.synchronize()
    for it in range(3):
        print('=================')
        print('it', it)
        implicit_sync(it == 0)
        stays_on_gpu(it == 0)
```

Effect of `.item()` in numpy:
```
import numpy as np

a = np.zeros(3)
print('type(a[0])', type(a[0]))
print('type(a[0].item())', type(a[0].item()))
```
Result:
```
type(a[0]) <class 'numpy.float64'>
type(a[0].item()) <class 'float'>
```

I know this would need a ton of work behind the scenes, and for conv nets, there is only one reduce, in the loss function.  It seems likely that it could have a noticeable effect on python code that does some kind of normalization using fundamental pytorch operations though?  So, this issue means such kernels need to be written in CUDA, rather than in Python, in order to obtain good performance. Fixing this issues would make it easier for Python guys to write high performance kernels for novel research layers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reductions returning scalars cause implicit sync-point #1989

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reductions returning scalars cause implicit sync-point #1989

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions