-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VecDeque's iterator could be optimized #30805
Comments
@dotdash The two for loops over slices in I've run into the same in the |
Could you clarify which SSE instruction set was used to get that 20x speedup? On Even on an old x86 cpu, using |
My tested platform is x86-64, Sandy Bridge, but with default settings, so only default instructions used. It's using f460: f3 0f 6f 52 90 movdqu xmm2,XMMWORD PTR [rdx-0x70]
f465: f3 0f 6f 5a a0 movdqu xmm3,XMMWORD PTR [rdx-0x60]
f46a: f3 0f 6f 62 b0 movdqu xmm4,XMMWORD PTR [rdx-0x50]
f46f: f3 0f 6f 6a c0 movdqu xmm5,XMMWORD PTR [rdx-0x40]
f474: 66 0f fe d0 paddd xmm2,xmm0
f478: 66 0f fe d9 paddd xmm3,xmm1
f47c: 66 0f fe d4 paddd xmm2,xmm4
f480: 66 0f fe dd paddd xmm3,xmm5
f484: f3 0f 6f 62 d0 movdqu xmm4,XMMWORD PTR [rdx-0x30]
f489: f3 0f 6f 6a e0 movdqu xmm5,XMMWORD PTR [rdx-0x20]
f48e: 66 0f fe e2 paddd xmm4,xmm2
f492: 66 0f fe eb paddd xmm5,xmm3
f496: f3 0f 6f 42 f0 movdqu xmm0,XMMWORD PTR [rdx-0x10]
f49b: f3 0f 6f 0a movdqu xmm1,XMMWORD PTR [rdx]
f49f: 66 0f fe c4 paddd xmm0,xmm4
f4a3: 66 0f fe cd paddd xmm1,xmm5
f4a7: 48 83 ea 80 sub rdx,0xffffffffffffff80
f4ab: 48 83 c3 e0 add rbx,0xffffffffffffffe0
f4af: 75 af jne f460 <_ZN11sum_deque_220hbee41beab497848aadaE+0x2b0> using |
Thanks, just the default Do you think it would be useful to create an The llvm ARM guy is interested in some more arm data points, care to chime in @emoon, @warricksothr? |
Your issue is only tangential to this issue, but you should make a more reduced test case if you only want to look at the loop's codegen. Here's what I'd look at, just the summation, which produces a big blob of autovectorized code on x86-64. https://play.rust-lang.org/?gist=8b542d1b17b8d69e44b7&version=nightly |
Thanks @bluss and sorry for the diversion. The fact unrolling is disabled on ARM by default was an interesting discovery. Using the |
Triage: No change using the |
VecDeque
's iterator seems to never allow llvm to vectorize loops, this could be improved.Even an enum where it uses the slice iterator for its contiguous case, and a fallback iterator, would allow vectorization for the contiguous case.
The following simple benchmark shows the massive difference between the deque iterator and using its two slice parts. The results are the same whether the deque is discontinuous or not. The summation in
sum_deque_2
is an order of magnitude faster, because llvm can autovectorize it.The text was updated successfully, but these errors were encountered: