GH-41813: [C++] Fix avx2 gather offset larger than 2GB in CompareColumnsToRows#42188
GH-41813: [C++] Fix avx2 gather offset larger than 2GB in CompareColumnsToRows#42188pitrou merged 24 commits intoapache:mainfrom
CompareColumnsToRows#42188Conversation
|
|
|
Hi @zanmato1984, thanks for your work on this. I'm hoping others can review the implementation but I did just check that the new test passes (it does) and also fixes the original issue (it does). 👍 |
Thank you @amoeba for verifying, and the help on reproducing the issue! |
|
I can't give much feedback or test this out, unfortunately. But I'm very thankful for you all looking into this! |
|
Ran into this issue when I was debugging my own issue where running a group_by/aggregate on a table with null columns was failing to group some keys, i.e. some group key value tuples were duplicated in the result. Why I mention it here:
However, the table I use is only 3.8MB - so perhaps there is some other bug around the AVX2 related code here as well, unrelated to the size, but related to nulls.
Repro case: Output without AVX2 (expected): Output with AVX2 (not expected): Some observations:
Let me know if you think I should open a new ticket for this. |
|
Hi @FreekPaans, can you please open a new issue for that? I think the issue will be fixed in the upcoming 17.x PyArrow release but it'd be good to make sure. |
|
Thanks. I'll test and follow up over on #42231. |
|
@pitrou @felipecrv @ZhangHuiGui @mapleFU Would you please help to take a look? Thanks. |
| irow_right = | ||
| _mm256_loadu_si256(reinterpret_cast<const __m256i*>(left_to_right_map) + i); | ||
| } | ||
| // TODO: Need to test if this gather is OK when irow_right is larger than |
There was a problem hiding this comment.
I'll test in the future.
There was a problem hiding this comment.
When you say "in the future", is it in this PR or another one?
There was a problem hiding this comment.
Oh sorry, I meant in another PR.
| irow_right = | ||
| _mm256_loadu_si256(reinterpret_cast<const __m256i*>(left_to_right_map) + i); | ||
| } | ||
| // TODO: Need to test if this gather is OK when irow_right is larger than |
There was a problem hiding this comment.
When you say "in the future", is it in this PR or another one?
| /// `0x80000000` lower. This way, the offset is always in range of [-2G, 2G) and those | ||
| /// intrinsics are safe. | ||
|
|
||
| constexpr auto two_gb = 0x80000000ull; |
There was a problem hiding this comment.
Can we make sure we use an explicit width type here? I'm not even sure what it is expected to be for correctness of the code using this constant (uint32_t or uint64_t?)
There was a problem hiding this comment.
Both uint32_t and uint64_t are OK. It only has to be unsigned and wide enough for 0x80000000. I'm declaring it uint64_t (the ull suffix) just to make all the arithmetics to be promoted to 64b to not worry about the potential underflow. The two subsequent usages are:
- Being added to pointer
baseafter divided by a specificsizeof(). The division is unsigned so the addition is addressing thebase"forward", as expected. - Being loaded to a signed
__m256iregister via an implicit static cast (after divided byscale).
I'll update to make it, and the usages, more more type and width explicit.
| } | ||
|
|
||
| template <int scale> | ||
| inline __m256i UnsignedOffsetSafeGather64(arrow::util::int64_for_gather_t const* base, |
There was a problem hiding this comment.
What is the use of int64_for_gather_t exactly?
There was a problem hiding this comment.
|
|
||
| constexpr auto two_gb = 0x80000000ull; | ||
|
|
||
| template <int scale> |
There was a problem hiding this comment.
Two things:
- if we're using unsigned arithmetic below, the scale type should probably be unsigned for readability and sanity?
- naming convention: can we make this
kScale?
There was a problem hiding this comment.
The type of the third formal parameter ofYeah, that's probably good._mm256_set1_epi32/64isintso I'm just usinginttoo.- Yeah, will do.
|
|
||
| namespace { | ||
|
|
||
| /// Intrinsics `_mm256_i32gather_epi32/64` treat the `vindex` as signed integer, and we |
There was a problem hiding this comment.
Can you use regular comments (//)? This isn't a docstring so shouldn't use the docstring-specific prefix (///)
| // number of rows. | ||
| constexpr int64_t num_rows = std::numeric_limits<uint16_t>::max() + 1; | ||
| const std::vector<std::shared_ptr<DataType>> fixed_length_types{uint64(), uint32()}; | ||
| // The var length column should be a little smaller than 2GB to WAR the capacity |
There was a problem hiding this comment.
Sorry, I meant "workaround". Will update.
|
|
||
| // Compare columns to rows at offsets over 2GB within a row table. | ||
| // Certain AVX2 instructions may behave unexpectedly causing troubles like GH-41813. | ||
| TEST(KeyCompare, CompareColumnsToRowsLarge) { |
There was a problem hiding this comment.
What is the runtime of this test? Perhaps we need to disable it on Valgrind builds.
There was a problem hiding this comment.
What do you mean by "runtime"? I can't think of a reason why Valgrind would complain (at least ASAN didn't).
There was a problem hiding this comment.
Sorry, I meant "run time" or execution time :-)
There was a problem hiding this comment.
Ah, got it! It takes about 20s with ASAN enabled. Perhaps it will be fine with Valgrind too?
There was a problem hiding this comment.
Ok, it takes 70s locally under Valgrind. That's a bit high for a single test, I would rather disable it under Valgrind.
There was a problem hiding this comment.
OK. Updated to disable the test under Valgrind. Thanks for helping running in your local!
| ASSERT_OK(row_encoder.EncodeSelected(&row_table, static_cast<uint32_t>(num_rows), | ||
| row_ids_right.data())); | ||
|
|
||
| ASSERT_TRUE(row_table.offsets()); |
There was a problem hiding this comment.
I'm not sure what's that supposed to check (offsets being "true"?). Do we want to make the test a bit more self-documenting, or perhaps add a comment?
There was a problem hiding this comment.
This is asserting the address of row_table.offsets() is not null, like if (some_pointer). Perhaps I can refine it to ASSERT_NE(row_table.offsets(), NULLPTR).
And the point of this check is to make sure the row_table constructed has an internal offset buffer, i.e., it contains var length columns.
There was a problem hiding this comment.
Yes, the ASSERT_NE suggestion would make this more easily understandable, thanks!
| base + kTwoGB / sizeof(arrow::util::int64_for_gather_t); | ||
| __m128i normalized_offset = | ||
| _mm_sub_epi32(offset, _mm_set1_epi32(static_cast<int>(kTwoGB / kScale))); | ||
| return _mm256_i32gather_epi64(normalized_base, normalized_offset, |
There was a problem hiding this comment.
I have a question about instructions.
Why is the vindex parameter type of _mm256_i32gather_epi32 is _m256i and the vindex type of _mm256_i32gather_epi64 is _m128i?
This may not be related to PR, I just want to understand it🫡
There was a problem hiding this comment.
Both intrinsics gather "several" integers based on a base address and "several" 32b offsets (vindex), and stores the results into a 256b register. The difference is: _mm256_i32gather_epi32 gathers 8 32b-integers (8 * 32 = 256) at a time so 8 32b indices are used, hence the 256b vindex. Whereas _mm256_i32gather_epi64 gathers 4 64b-integers at a time so 4 32b indices are used, hence the 128b vindex.
|
I've committed two changes containing code restructures (moving, renaming, etc.) and a minor fix to the test, to make the test logic more clear and readable. Hope it doesn't trouble your review @pitrou . Thanks. |
|
Thanks a lot @zanmato1984 ! |
|
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit e635cc2. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 7 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…ColumnsToRows` (#43065) ### Rationale for this change See #43046. ### What changes are included in this PR? Use unsigned offset safe gather introduced in #42188 which is to fix similar issues. ### Are these changes tested? Yes. ### Are there any user-facing changes? None. * GitHub Issue: #43046 Lead-authored-by: Ruoxi Sun <[email protected]> Co-authored-by: Rossi Sun <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Rationale for this change
AVX2 intrinsics
_mm256_i32gather_epi32/_mm256_i32gather_epi64are used inCompareColumnsToRowsAPI, and treat thevindexas signed integer. In our row table implementation, we useuint32_tto represent the offset within the row table. When a offset is larger than (0x80000000, or2GB), the aforementioned intrinsics will treat it as negative offset and gather the data from undesired address. More details please see #41813 (comment).Considering there is no unsigned-32bit-offset or 64bit-offset counterparts of those intrinsics in AVX2, this issue can be simply mitigated by translating the base address and the offset:
What changes are included in this PR?
Fix and UT that reproduces the issue.
Are these changes tested?
UT included.
Are there any user-facing changes?
None.