-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
Scalar reduction results are currently being returned as host-side scalars, causing an implicit sync-point, and gpu->device copy. See the following results:
=================
it 0
------ .sum(), returns a host-side scalar ---------
time for .sum() 0.040
time for synchronize 0.000
res 2000001152.0
---- .sum(1), returns proxy to gpu-side tensor -----
time for .sum(1) 0.000
time for synchronize 0.462
res
2.0000e+09
[torch.cuda.FloatTensor of size 1x1 (GPU 0)]
=================
it 1
------ .sum(), returns a host-side scalar ---------
time for .sum() 0.039
time for synchronize 0.000
---- .sum(1), returns proxy to gpu-side tensor -----
time for .sum(1) 0.000
time for synchronize 0.462
=================
it 2
------ .sum(), returns a host-side scalar ---------
time for .sum() 0.040
time for synchronize 0.000
---- .sum(1), returns proxy to gpu-side tensor -----
time for .sum(1) 0.000
time for synchronize 0.456
You can see that .sum(1) returns instantly, and the pain doesnt happen till the later .synchronize(). But .sum() waits for the entire gpu operation to finish, and returns a host-side scalar. .sum() is implicitly, behind the scenes:
- waiting for the entire operation to finish
- sync'ing
- copying the result from the gpu back to the hostside
I'm not sure this is desirable? It's also inconsistent with how numpy works, which doesnt convert scalars into a non-numpy scalar until you call .item().
Test code for the implicit sync points:
import torch
import time
N = 1000 * 1000 * 1000
def implicit_sync(print_res=False):
print('------ .sum(), returns a host-side scalar ---------')
# start = time.time()
# a = torch.ones(N).cuda()
# a += 1
# torch.cuda.synchronize()
start = time.time()
res = a.sum()
print('time for .sum() %.3f' % (time.time() - start))
start = time.time()
torch.cuda.synchronize()
print('time for synchronize %.3f' % (time.time() - start))
if print_res:
print('res', res)
def stays_on_gpu(print_res=False):
print('---- .sum(1), returns proxy to gpu-side tensor -----')
start = time.time()
res = a.sum(1)
print('time for .sum(1) %.3f' % (time.time() - start))
start = time.time()
torch.cuda.synchronize()
print('time for synchronize %.3f' % (time.time() - start))
if print_res:
print('res', res)
if __name__ == '__main__':
a = torch.ones(1, N).cuda()
a += 1
torch.cuda.synchronize()
for it in range(3):
print('=================')
print('it', it)
implicit_sync(it == 0)
stays_on_gpu(it == 0)
Effect of .item() in numpy:
import numpy as np
a = np.zeros(3)
print('type(a[0])', type(a[0]))
print('type(a[0].item())', type(a[0].item()))
Result:
type(a[0]) <class 'numpy.float64'>
type(a[0].item()) <class 'float'>
I know this would need a ton of work behind the scenes, and for conv nets, there is only one reduce, in the loss function. It seems likely that it could have a noticeable effect on python code that does some kind of normalization using fundamental pytorch operations though? So, this issue means such kernels need to be written in CUDA, rather than in Python, in order to obtain good performance. Fixing this issues would make it easier for Python guys to write high performance kernels for novel research layers.