-
Notifications
You must be signed in to change notification settings - Fork 26.3k
optimize topk on cpu using parallel and partial sort #19736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
aten/src/ATen/native/Sorting.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unfortunate that we need to copy the data inside the TensorAccessor.
@ezyang do you know how hard it would be to make TensorAccessor work with std::sort and alike? Would that be something we would want to do at some point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just get a raw data pointer if it's contiguous and sort in-place. You're unlikely to get std::sort working natively on strided stuff though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, if TensorAccessor respects the following requirements, it would work
Type requirements
RandomItmust meet the requirements of ValueSwappable and LegacyRandomAccessIterator.- The type of dereferenced
RandomItmust meet the requirements of MoveAssignable and MoveConstructible.
My question is more like, how much effort would it be to make this happen, if possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just like @ezyang said, in case dim=-1, (sorting dimension is contiguous), it is possible to create std::vector directly on the raw pointer. But if not (non-contiguous), it is not going to work.
I measured the time cost of this copy, it is not that big compared to the sorting. And the current topk also has copy, here, since the sorting needs to swap the value/indice.
Anyway, if TensorAccessor and std::vector has some sort of in place conversion, it would be more convenient.
|
While not conflicting with this PR, #18344 is relevant, as it optimizes kthvalue for float, and one could use kthvalue to implement topk |
Yes, this one is not conflicting #18344. In case the |
|
@cpuhrsch I believe the kernels in faiss were (are?) the same as the ones in PyTorch. Indeed, @wickedfoo was the one who added support for topk on the GPU back in the Lua days, and I'm not sure if there are differences nowadays that would be good to port back |
|
the ones in faiss are significantly different than pytorch, though jhj is the person who wrote both. The ones in faiss are also significantly more complex than what pytorch needs. |
aten/src/ATen/native/Sorting.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very close to TensorIterators. In fact, dim_apply overall looks like it could be replaced by that. This would yield much cleaner and usually faster code. @VitalyFedyunin, please provide guidance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the reason why TensorIterator cant be used here is that they cant take a set of input values and return a single output value (which is the case in sort / topk). But it would be great to extend TensorIterator to support this use-case
|
@mingfeima - could you also provide benchmarks for varying numbers of threads? In particular single core performance. |
|
@cpuhrsch scaling result with after: This patch did parallelization only so won't help when using single core. But in case you use more than 2 cores, it helps. In case we use parallelization, i was looking at |
|
This thread seems to be about the CPU but there was a question about the GPU. The GPU top-k code in (Lua) Torch is different than the top-k code in Caffe2 is different than the top-k code in Faiss. I wrote all three. The code in Torch supports arbitrary k (1 <= k <= n) via radix sort. It can be slow because it requires multiple passes over the input. The topk call in Torch needs to support arbitrary k, but it can have a specialization for small k (e.g., k <= 1024 or 2048). The code in Caffe2 I think supports small k only, but could be wrong on that. It contains code from a very early version of Faiss. The code in Faiss is fastest, but it only supports k <= 2048. |
|
@mingfeima can you please run compare before, after for k=10,100,C/10,C/2, C-5 |
I was assuming that k is not a very large number (in beam search it is |
|
@mingfeima it would also be good to run it with a single thread, to assess the impact of performing a copy of the data into a vector to feed to the Here is another example where objectness = torch.rand(2, 100000)
pre_nms_top_n = 2000
objectness, topk_idx = objectness.topk(pre_nms_top_n, dim=1, sorted=True) |
|
1f225f3 to
da1f9c8
Compare
|
@VitalyFedyunin @fmassa Hi, all the proposed patterns have been tested to have a more generic converage:
All the details including:
are listed in the gist, otherwise it's too much to show in this thread. To sum up a little bit, with the latest code in this pr,
Speedup from single thread runs come from |
|
cc @VitalyFedyunin for TensorIterator. @mingfeima - could I ask you to rebase the PR so we can import it? |
|
Writing |
Build error C2446 with lambda conditional expression with MSVC https://stackoverflow.com/questions/27989031/msvc-error-when-using-capture-less-lambda-expressions-as-second-and-third-operan/27989291
change lambda use partial sort when k is of small range
da1f9c8 to
0d459b7
Compare
|
@VitalyFedyunin The CPU kernels have been moved to Perf is slightly better compared with last version,
|
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: This PR aims at improving `topk()` performance on CPU. This is useful when computing **beam search** during `Transformer` and `BERT`. Given a tensor x of size `[N, C]`, and we want to apply `x.topk(K)`, the current logic is **sequentially** loop on the dimension of `N` and do **quick select** on the dimension of `C` so as to find out top K elements. Performance can be further improved from: - On the dimension of `N`, it can be paralleled - Maybe a faster sorting algorithm for `topk`. (After a bunch of experimenting, `std::partial_sort` seems to be the most promising) So i compared 3 versions: 1. vanilla: sequential + quick select 2. reference PR pytorch/pytorch#19737: parallel + quick select 3. this PR: parallel + partial sort with the following benchmark, on `Xeon 8180, 2*28 [email protected] GHz`: ```python import torch from time import time num_iters = 1000 def bench_topk(N=8, C=168560, k=10): a = torch.randn(N, C) # warm up for i in range(100): torch.topk(a, k) t = 0 for i in range(num_iters): a = torch.randn(N, C) start = time() value, indice = torch.topk(a, k) t += time() - start print("#[%d, %d] times: %f ms" % (N, C, t / num_iters * 1000)) Ns = [10, 20, 30] Cs = [10000, 20000, 40000, 80000, 160000, 320000] for n in Ns: for c in Cs: bench_topk(N=n, C=c) ``` ### vanilla: sequential + quick select ``` #[10, 10000] times: 0.746740 ms #[10, 20000] times: 1.437399 ms #[10, 40000] times: 2.832455 ms #[10, 80000] times: 5.649426 ms #[10, 160000] times: 11.309466 ms #[10, 320000] times: 22.798765 ms #[20, 10000] times: 1.511303 ms #[20, 20000] times: 2.822024 ms #[20, 40000] times: 5.564770 ms #[20, 80000] times: 11.443044 ms #[20, 160000] times: 22.747731 ms #[20, 320000] times: 46.234449 ms #[30, 10000] times: 2.214045 ms #[30, 20000] times: 4.236179 ms #[30, 40000] times: 8.418577 ms #[30, 80000] times: 17.067578 ms #[30, 160000] times: 33.826214 ms #[30, 320000] times: 68.109420 ms ``` ### reference PR: parallel + quick select ``` #[10, 10000] times: 0.271649 ms #[10, 20000] times: 0.593016 ms #[10, 40000] times: 1.133518 ms #[10, 80000] times: 2.082355 ms #[10, 160000] times: 4.049928 ms #[10, 320000] times: 7.321285 ms #[20, 10000] times: 0.315255 ms #[20, 20000] times: 0.539054 ms #[20, 40000] times: 1.000675 ms #[20, 80000] times: 1.914586 ms #[20, 160000] times: 4.437122 ms #[20, 320000] times: 8.822445 ms #[30, 10000] times: 0.347209 ms #[30, 20000] times: 0.589947 ms #[30, 40000] times: 1.102814 ms #[30, 80000] times: 2.112201 ms #[30, 160000] times: 5.186837 ms #[30, 320000] times: 10.523023 ms ``` ### this PR: parallel + partial sort ``` #[10, 10000] times: 0.150284 ms #[10, 20000] times: 0.220089 ms #[10, 40000] times: 0.521875 ms #[10, 80000] times: 0.965593 ms #[10, 160000] times: 2.312356 ms #[10, 320000] times: 4.759422 ms #[20, 10000] times: 0.167630 ms #[20, 20000] times: 0.265607 ms #[20, 40000] times: 0.471477 ms #[20, 80000] times: 0.974572 ms #[20, 160000] times: 3.269645 ms #[20, 320000] times: 6.538608 ms #[30, 10000] times: 0.204976 ms #[30, 20000] times: 0.342833 ms #[30, 40000] times: 0.589381 ms #[30, 80000] times: 1.398579 ms #[30, 160000] times: 3.904077 ms #[30, 320000] times: 9.681224 ms ``` In summary, `2` is **5x** faster than `vanilla` on average and `3` is **8.6x** faster than `vanilla`. On `Fairseq Transformer`, the default parameter on dataset `wmt14` would have a `topk` size of `[8, 168560]`, and this operator gets `3x` faster with this PR. Pull Request resolved: pytorch/pytorch#19736 Differential Revision: D16204820 Pulled By: VitalyFedyunin fbshipit-source-id: ea70562c9149a0d832cf5872a891042ebd74fc63
|
@VitalyFedyunin merged this pull request in 10c14ad. |
…) (#22865) Summary: pytorch/pytorch#19736 was reverted as it was suspected to be broken on the master, trying to reapply Pull Request resolved: pytorch/pytorch#22865 Differential Revision: D16265457 Pulled By: VitalyFedyunin fbshipit-source-id: 784bd6405471f15a8a49ebd0f3e98160d7d0679e
This PR aims at improving
topk()performance on CPU. This is useful when computing beam search duringTransformerandBERT.Given a tensor x of size
[N, C], and we want to applyx.topk(K), the current logic is sequentially loop on the dimension ofNand do quick select on the dimension ofCso as to find out top K elements.Performance can be further improved from:
N, it can be paralleledtopk. (After a bunch of experimenting,std::partial_sortseems to be the most promising)So i compared 3 versions:
with the following benchmark, on
Xeon 8180, 2*28 [email protected] GHz:vanilla: sequential + quick select
reference PR: parallel + quick select
this PR: parallel + partial sort
In summary,
2is 5x faster thanvanillaon average and3is 8.6x faster thanvanilla.On
Fairseq Transformer, the default parameter on datasetwmt14would have atopksize of[8, 168560], and this operator gets3xfaster with this PR.