-
Notifications
You must be signed in to change notification settings - Fork 26.3k
CUDA sparse operations (+ CPU improvements) #1147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
test/common.py
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
test/common.py
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
test/test_sparse.py
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
test/test_sparse.py
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/lib/THCS/generic/THCSTensor.cu
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/lib/THCS/generic/THCSTensor.cu
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/lib/THS/generic/THSTensor.c
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/nn/_functions/thnn/sparse.py
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
A word of caution - be very careful with cusparse, because for some routines it is calling cudaFree (grrrr!), which not only has performance implications, but will deadlock on multiGPU runs with nccl. |
torch/optim/adagrad.py
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
adamlerer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is super impressive, thank you! I
General point: I think the main use case is for LookupTable, right? My benchmarks suggest that while this is much faster than dense, we still have a ways to go to match legacy LookupTable performance. I think we need to copy over the LookupTable kernels to handle the corresponding sparse ops, so we can match the perf.
$ pt benchmark.py 0
cuda=False
Old done in 0.169607 s
New done in 1.51252 s
New done in 0.791161 s (no optimizer)
$ pt benchmark.py 1
cuda=True
Old done in 0.127513 s
New done in 3.06416 s
New done in 2.91579 s (no optimizer)
test/test_sparse.py
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
test/test_sparse.py
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/lib/THCS/THCSTensor.h
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/lib/THCS/generic/THCSTensor.c
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/lib/THCS/generic/THCSTensor.c
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/lib/THCS/THCSTensor.cu
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
@martinraison has addressed all my comments for merge - I will try to speed up the GPU sparse operations necessary for lookup table this week, but that doesn't block merging this. |
|
I'm still unsure about the meaning and handling of contiguous tensors. In dense libs, we never make the input contiguous for the user, because it could break sharing + it makes for a much less magical API. I'm also worried that someone will be adding non-linear ops, and they will forget that non-contiguous means not only that the indices aren't ordered, but they're duplicated too. These bugs will be hard to catch, because it's unlikely that someone will test with such inputs. Is it true that right now we don't do any |
@fmassa also noticed it is surprising that
I believe making indices unique is an O(NlogN) operation anyway, so if we're doing that we might as well make the tensor contiguous. However I think we'd be wasting a lot of computation if we were forcing all sparse tensors to always be contiguous. For example it means you would have to spend O(NlogN) time for every single sparse gradient aggregation, rather than doing a single O(NlogN) operation when applying the gradient update.
In the case of Adagrad, which is non-linear, we have to call |
|
I dug into speeding up sparse for nn.Embedding based on my benchmark on GPU (https://gist.github.com/adamlerer/865c1a09000c7dc8208e1456255209c2). I think my findings apply more broadly though.
|
|
@adamlerer, a note about cudnn being non-deterministic
|
|
@apaszke re: contiguous, lets think about exactly what we mean by "weight sharing". I think you mean something like Except that this already doesn't work for sparse tensors! There's no way (at least currently) to have a view on a sparse tensor. So I don't think this matters. I'd like to think of sparse tensor reordering (which we really should give a new name than @martinraison yes I agree we should have a deterministic variant. But notice that this extends to indexAdd as well. Currently LookupTable is deterministic, indexAdd is non-deterministic. So I think we should have a deterministic and non-deterministic variant of indexAdd, and LookupTable / spcAdd should just delegate to that. I benchmarked autograd overhead like this, with small batch size like 10. It takes ~700us / batch, and CPU sits at 100%. |
|
@adamlerer so do you propose that only the reordering should be an implementation detail, or would you like to make sorting an internal thing too? The ordering isn't visible externally, is it? |
|
The My thought is that |
|
this is now rebased to fewer commits and merged into master |
* support for fused dense layer with cublasLt, fusion in both fprop and bprop * fix typo causing syntax error * add fused GEMM+gelu+GEMM modue * fix typo for workspace size * update cublas check for 11600 * add tests for fused dense layer * fix CUDA 10.x path Co-authored-by: Sukru Eryilmaz <[email protected]>
This is a follow-up to #735 . The goal was to add a similar level of support for CUDA sparse tensors. This includes:
sparse_maskto_densecontiguoustransposespaddmmspcaddmul,divcadd,csub,cmulhspmmoperation that does multiplies a sparse matrix with a dense matrix and returns a matrix in the form of a hybrid tensor (i.e. 1 sparse dimension, 1 dense dimension). This seems to be the most natural output format for this problem, since the output is a collection of non-zero rowsEmbeddinglayer to work with CUDA and use sparse efficient sparse operationsBonus:
contiguousfor CPU sparse tensors (the previous implementation was using insertion sort)caddfor CPU sparse tensors using blasabsforShortTensor. For some reason this was missing, and I needed it for the testing codeI hacked this quick thing together just to get a sense of the speed improvement: https://gist.github.com/martinraison/1e7c18c6f6eda87f1cb4995b0e6a22a5
With the default params I get:
This shouldn't be considered as a benchmark though (it measures the time for a complete training iteration, including all the python processing, and the forward/backward passes through tanh/linear/cross-entropy)