-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Improve torch.cdist performance #20605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Current CPU implementation: import timeit
SETUP_CODE = '''
import torch
from scipy.spatial.distance import cdist
a = torch.randn(100, 2)
b = torch.randn(200, 2)'''
TEST_CODE = '''torch.cdist(a, b)'''
timeit.repeat(setup = SETUP_CODE,stmt = TEST_CODE,repeat = 3,number = 10000)
[4.789991227999998, 4.700422251000006, 5.000459026999991]
TEST_CODE = '''cdist(a, b)'''
timeit.repeat(setup = SETUP_CODE,stmt = TEST_CODE,repeat = 3,number = 10000)
[0.6307194809999999, 0.5901044249999998, 0.5976278489999984]
SETUP_CODE = '''
import torch
from scipy.spatial.distance import cdist
a = torch.randn(2, 200)
b = torch.randn(2, 200)'''
TEST_CODE = '''torch.cdist(a, b)'''
timeit.repeat(setup = SETUP_CODE,stmt = TEST_CODE,repeat = 3,number = 10000)
[0.027029143000000033, 0.02519462499999925, 0.0251151329999999]
TEST_CODE = '''cdist(a, b)'''
timeit.repeat(setup = SETUP_CODE,stmt = TEST_CODE,repeat = 3,number = 10000)
[0.18516300100000294, 0.16735686699999874, 0.18243102399999955]Improved CPU implementation: import timeit
SETUP_CODE = '''
import torch
from scipy.spatial.distance import cdist
a = torch.randn(100, 2)
b = torch.randn(200, 2)'''
TEST_CODE = '''torch.cdist(a, b)'''
timeit.repeat(setup = SETUP_CODE,stmt = TEST_CODE,repeat = 3,number = 10000)
[0.5074237539993192, 0.4883652280004753, 0.48943868999958795]
TEST_CODE = '''cdist(a, b)'''
timeit.repeat(setup = SETUP_CODE,stmt = TEST_CODE,repeat = 3,number = 10000)
[0.6026854669999011, 0.6011174650011526, 0.5975362090011913]
SETUP_CODE = '''
import torch
from scipy.spatial.distance import cdist
a = torch.randn(2, 200)
b = torch.randn(2, 200)'''
TEST_CODE = '''torch.cdist(a, b)'''
timeit.repeat(setup = SETUP_CODE,stmt = TEST_CODE,repeat = 3,number = 10000)
[0.04526426500160596, 0.03518587399958051, 0.029068126999845845]
TEST_CODE = '''cdist(a, b)'''
timeit.repeat(setup = SETUP_CODE,stmt = TEST_CODE,repeat = 3,number = 10000)
[0.18631748199914, 0.17298765699888463, 0.17551641600039147] |
|
Current GPU implementation: import timeit
SETUP_CODE = '''
import torch
from gpytorch.kernels.kernel import Distance
dist = Distance()
a = torch.randn(10000, 9).cuda()
b = torch.randn(30000, 9).cuda()'''
TEST_CODE = '''D1=torch.cdist(a, b); print(D1[0, 0])'''
timeit.repeat(setup = SETUP_CODE,stmt = TEST_CODE,repeat = 3,number = 10)
[47.1045588250272, 47.20872523600701, 47.47344793193042]
SETUP_CODE = '''
import torch
from gpytorch.kernels.kernel import Distance
dist = Distance()
a = torch.randn(9, 10000).cuda()
b = torch.randn(9, 10000).cuda()'''
TEST_CODE = '''D1=torch.cdist(a, b); print(D1[0, 0])'''
timeit.repeat(setup = SETUP_CODE,stmt = TEST_CODE,repeat = 3,number = 10)
[0.009627211024053395, 0.007205849979072809, 0.007276762975379825]Improved GPU implementation: import timeit
SETUP_CODE = '''
import torch
from gpytorch.kernels.kernel import Distance
dist = Distance()
a = torch.randn(10000, 9).cuda()
b = torch.randn(30000, 9).cuda()'''
TEST_CODE = '''D1=torch.cdist(a, b); print(D1[0, 0])'''
timeit.repeat(setup = SETUP_CODE,stmt = TEST_CODE,repeat = 3,number = 10)
[7.402155780000612, 7.373479370959103, 7.379088997957297]
SETUP_CODE = '''
import torch
from gpytorch.kernels.kernel import Distance
dist = Distance()
a = torch.randn(9, 10000).cuda()
b = torch.randn(9, 10000).cuda()'''
TEST_CODE = '''D1=torch.cdist(a, b); print(D1[0, 0])'''
timeit.repeat(setup = SETUP_CODE,stmt = TEST_CODE,repeat = 3,number = 10)
[0.010255609988234937, 0.00916548096574843, 0.008794950088486075] |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ifedan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@pytorchbot retest this please |
|
An intuitive explanation for why this is faster in those cases would be nice. |
|
If I understand this correctly, it looks like you are doing two different things: (1) Removing the division from the tight loop. I understand why (1) is a win, but not (2)... |
|
Each row with size M in first tensor(a) is used with each row with size M in second tensor(b) to calculate cdist: on CPU I vectorize the data and parallel it as following. Then I use Map-Reduce to calculate correspondent result. on GPU the grid size = R1*R2 and number of threads per block = 256. I use Map-Reduce within warpSize to calculate result. There are two issues:
|
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ifedan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@pytorchbot retest this please |
Summary: Fix based on pytorch/pytorch#15253 Pull Request resolved: pytorch/pytorch#20605 Differential Revision: D15396123 Pulled By: ifedan fbshipit-source-id: 3ed373e68339a35360f083d4aad1b655abcaf97e



Fix based on #15253