-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[WIP] Vectorized partition function for floats on avx2 machines. #18344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Speeds up kthvalue on cpu; can be used for topk and sort in the future.
| static const uint32_t __attribute__((aligned(256))) t[256*8]; | ||
| }; | ||
|
|
||
| const uint32_t __attribute__((aligned(256))) Table<float>::t[] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment explaining how this has been computed and what it is would be very useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I have added a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment says what the table is (good!) but it doesn't say how to compute the table (you probably ran a Python script or something to generate the table, right? Paste it in here.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather see something like this in vec256 directly, since it's very specific to 256bit vectors.
| using at::vec256::Vec256; | ||
| using at::vec256::permute; | ||
| using at::vec256::int_same_size_t; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhere in this file we should quote the paper where this algorithm comes from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, updated to cite that paper at the top of the file.
|
@ezyang are you able to review this anytime soon? If not it's no problem, will add others. |
|
I haven't looked at the code or algorithm, but test failures look legit. Apparently there is a case when it's not working as expected. The other question I have is whether this falls back as expected when the AVX isn't available at run time (I think the code in cpu usually is only called when it is). |
| DEFINE_COMP(<) | ||
| #undef DEFINE_COMP | ||
|
|
||
| c10::guts::enable_if_t<size() <= 32, int32_t> msb_mask() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have a comment here just briefly saying what this does.
|
|
||
| }; | ||
|
|
||
| template <class T> inline Vec256<T> permute(const Vec256<T>& src, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
| values = _mm256_setr_epi32(val1, val2, val3, val4, val5, val6, val7, val8); | ||
| } | ||
| explicit operator __m256() const { | ||
| return (__m256)values; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: use static_cast here
|
|
||
| template <> | ||
| Vec256<int32_t> permute(const Vec256<int32_t>& src, const Vec256<int32_t>& indices) { | ||
| return (__m256i)_mm256_permutevar8x32_ps((__m256)src, indices); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and here
| #pragma once | ||
| #include <ATen/NumericUtils.h> | ||
| #include <stdint.h> | ||
| #include <aten/src/ATen/cpu/vec256/vec256.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, I desperately need comments in this file.
| *mode_value = tmp_values[k - 1]; | ||
| *mode_index = tmp_indices[k - 1]; | ||
| }); | ||
| bool can_use_small_index = self.size(dim) <= std::numeric_limits<int_same_size_t<scalar_t>>::max(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused by the small index logic here (in particular, I don't see why index size should have anything to do with the size of the scalar type; that seems really weird to me!) An explanatory note would help here.
|
|
||
|
|
||
| template <typename scalar_t, typename index_t, typename ValFn, typename IdxFn> | ||
| void qsel_with_indices(const Tensor& self, Tensor& values, Tensor& indices, ValFn finalize_vals, IdxFn finalize_idxs, index_t k, bool largest, int64_t dim) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once again, would be super helpful to know what finalize_vals and finalize_idxs are supposed tod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look like largest is ever called at false. What's going on?
| using index_t = int_same_size_t<scalar_t>; | ||
| qsel_with_indices<scalar_t, index_t>(self, values, indices, | ||
| [k](const scalar_t *vals, const Tensor& val_slice) { | ||
| *val_slice.data<scalar_t>() = vals[k - 1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data<>() is a checked pointer access, which I think is inappropriate here because I believe you're calling this in a loop. However, I don't see an unsafe_data member on Tensor, I guess we should add one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a little shocking that you're passing an entire honking Tensor reference into this function; that sounds like a bunch of unnecessary allocations, especially when you just want to poke some memory at one location.
| [&](int64_t i, TensorList tl) { | ||
| auto self = tl[0].accessor<scalar_t, 1>(); | ||
| auto sz = self.size(0); | ||
| std::unique_ptr<index_t[]> tmp_indices_(new index_t[sz]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Err, why not just use a std::vector here
| index_t R, | ||
| scalar_t piv, | ||
| bool largest, | ||
| int) -> decltype(vec_qs_partition_inplace((scalar_t *)(nullptr), (scalar_t *)(nullptr), (index_t *)(nullptr), scalar_t(), bool()), index_t()){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ummmm... this is just an int32_t, right? Can we just say that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, are these shenanigans because the type may change based on index_t? But I am still not sure why the return type cannot be straightforwardly written in terms of index_t...
| } | ||
|
|
||
| template <typename scalar_t, typename index_t> | ||
| auto partition( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd love a comment here saying what the contract between L, R, piv and the output values are, at the very least.
| // Re-set active partition | ||
| if (j <= k) | ||
| L = i; | ||
| L = j; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happened here? (I'm sure it's right, but I found it surprising!)
|
This PR is not complete because it needs to use the registration mechanism to decide whether or not to use the AVX2 kernel or not. That likely explains the test failures; I don't see any use of our existing dispatching mechanisms for AVX2 kernels. |
| namespace at { | ||
| namespace native { | ||
| namespace { | ||
| // fallback to scalar when we don't have avx2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, so this says it falls back when we don't have avx2, but you're doing this as a compile time test, which is not sufficient. It needs to be a runtime test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grep for REGISTER_DISPATCH to see how to do this properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I suspect you may have to do more work. The existing code for handling machines which can vectorize or not do this purely by recompiling Vec256 multiple times under different AVX settings. But in your case you need something a little different, you don't want to run the algorithm above at all when you don't have sufficient AVX support, you just want to do something completely different. I'm not exactly sure how to spell that presently. cc @colesbury @cpuhrsch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can guard using __AVX__ macros
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Err, really? I thought __AVX__ says whether or not the compiler has AVX support. But that's not what we need to control over here; we need to control whether or not the runtime has AVX support.
| }; | ||
|
|
||
| template <typename T, bool largest> | ||
| int_same_size_t<T> nan_partition(T *begin, T *end, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, uh, what exactly does this do? Just partition away the NaNs? Or something else? A comment would be worth a lot here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it looks like this is what you call when the pivot is a NaN. That's all I needed in the comment :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OOC: which test exercises this case?
| return _mm256_cmp_ps(values, other.values, _CMP_GE_OQ); | ||
| } | ||
|
|
||
| Vec256<float> not_leq(const Vec256<float>& other) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is "not less than or equal to"?
|
|
||
| int_same_size_t<T> *indices_left_out = indices; | ||
| int_same_size_t<T> *indices_right_out = indices + sz; | ||
| while (true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Highly not a fan of 'while true'
|
Can't some of this stuff be done using TensorIterator reductions? |
| matches_jit_signature: True | ||
| python_module: nn | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of course you'll delete this
| @@ -0,0 +1,572 @@ | |||
| // AVX2 optimized partition algorithm | |||
| // Based on the following paper: | |||
| // Shay Gueron, Vlad Krasnov, Fast Quicksort Implementation Using AVX Instructions, The Computer Journal, Volume 59, Issue 1, January 2016, Pages 83–90, https://doi.org/10.1093/comjnl/bxv063 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
General comment: I have a hard time telling what |
| int_same_size_t<T> nan_partition(T *begin, T *end, | ||
| int_same_size_t<T> *indices) { | ||
| // We consider NaNs to be larger than any element. | ||
| if (!largest) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what's going on here.
I think the problem is that what I think of as the logical sense for |
|
It seems unlikely that this is ever going to land. Closing the PR. |
Speeds up kthvalue on cpu; can be used for topk and sort in the future.
The idea is to compare 8 elements at a time to the pivot, then reshuffle them within a 256-bit register by using a lookup table according to the comparison, and then write them back into the array at the required positions. (I got most of the ideas from here and then modified the algorithm a bit to be in-place.)
Check the comments in
qs_partition_inplaceofQuickPartition.cppfor more details.This can be easily extended to doubles, but the speedup might not be as great because you'll only sort 4 elements at a time.
Future possibilities:
Timings: