Skip to content

Reductions returning scalars cause implicit sync-point #1989

@hughperkins

Description

@hughperkins

Scalar reduction results are currently being returned as host-side scalars, causing an implicit sync-point, and gpu->device copy. See the following results:

=================
it 0
------ .sum(), returns a host-side scalar ---------
time for .sum() 0.040
time for synchronize 0.000
res 2000001152.0
---- .sum(1), returns proxy to gpu-side tensor -----
time for .sum(1) 0.000
time for synchronize 0.462
res
 2.0000e+09
[torch.cuda.FloatTensor of size 1x1 (GPU 0)]

=================
it 1
------ .sum(), returns a host-side scalar ---------
time for .sum() 0.039
time for synchronize 0.000
---- .sum(1), returns proxy to gpu-side tensor -----
time for .sum(1) 0.000
time for synchronize 0.462
=================
it 2
------ .sum(), returns a host-side scalar ---------
time for .sum() 0.040
time for synchronize 0.000
---- .sum(1), returns proxy to gpu-side tensor -----
time for .sum(1) 0.000
time for synchronize 0.456

You can see that .sum(1) returns instantly, and the pain doesnt happen till the later .synchronize(). But .sum() waits for the entire gpu operation to finish, and returns a host-side scalar. .sum() is implicitly, behind the scenes:

  • waiting for the entire operation to finish
  • sync'ing
  • copying the result from the gpu back to the hostside

I'm not sure this is desirable? It's also inconsistent with how numpy works, which doesnt convert scalars into a non-numpy scalar until you call .item().

Test code for the implicit sync points:

import torch
import time


N = 1000 * 1000 * 1000

def implicit_sync(print_res=False):
    print('------ .sum(), returns a host-side scalar ---------')
    # start = time.time()
    # a = torch.ones(N).cuda()
    # a += 1
    # torch.cuda.synchronize()
    start = time.time()
    res = a.sum()
    print('time for .sum() %.3f' % (time.time() - start))
    start = time.time()
    torch.cuda.synchronize()
    print('time for synchronize %.3f' % (time.time() - start))
    if print_res:
        print('res', res)

def stays_on_gpu(print_res=False):
    print('---- .sum(1), returns proxy to gpu-side tensor -----')
    start = time.time()
    res = a.sum(1)
    print('time for .sum(1) %.3f' % (time.time() - start))
    start = time.time()
    torch.cuda.synchronize()
    print('time for synchronize %.3f' % (time.time() - start))
    if print_res:
        print('res', res)


if __name__ == '__main__':
    a = torch.ones(1, N).cuda()
    a += 1
    torch.cuda.synchronize()
    for it in range(3):
        print('=================')
        print('it', it)
        implicit_sync(it == 0)
        stays_on_gpu(it == 0)

Effect of .item() in numpy:

import numpy as np

a = np.zeros(3)
print('type(a[0])', type(a[0]))
print('type(a[0].item())', type(a[0].item()))

Result:

type(a[0]) <class 'numpy.float64'>
type(a[0].item()) <class 'float'>

I know this would need a ton of work behind the scenes, and for conv nets, there is only one reduce, in the loss function. It seems likely that it could have a noticeable effect on python code that does some kind of normalization using fundamental pytorch operations though? So, this issue means such kernels need to be written in CUDA, rather than in Python, in order to obtain good performance. Fixing this issues would make it easier for Python guys to write high performance kernels for novel research layers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions