Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VecDeque's iterator could be optimized #30805

Open
bluss opened this issue Jan 10, 2016 · 7 comments
Open

VecDeque's iterator could be optimized #30805

bluss opened this issue Jan 10, 2016 · 7 comments
Labels
A-collections Area: `std::collections` C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@bluss
Copy link
Member

bluss commented Jan 10, 2016

VecDeque's iterator seems to never allow llvm to vectorize loops, this could be improved.

Even an enum where it uses the slice iterator for its contiguous case, and a fallback iterator, would allow vectorization for the contiguous case.

The following simple benchmark shows the massive difference between the deque iterator and using its two slice parts. The results are the same whether the deque is discontinuous or not. The summation in sum_deque_2 is an order of magnitude faster, because llvm can autovectorize it.

/* Results
test sum_deque   ... bench:       1,301 ns/iter (+/- 111)
test sum_deque_2 ... bench:          71 ns/iter (+/- 2)
*/

#![feature(test)]
extern crate test;

use test::Bencher;
use test::black_box;

use std::collections::VecDeque;


#[bench]
fn sum_deque(b: &mut Bencher) {
    let mut dq: VecDeque<_> = (0..1024).collect();
    /*
    for _ in 0..128 {
        let elt = dq.pop_back().unwrap();
        dq.push_front(elt);
    }
    */
    let dq = black_box(dq);

    b.iter(|| {
        let mut sum = 0;
        for elt in &dq {
            sum += *elt;
        }
        sum
    })
}


#[bench]
fn sum_deque_2(b: &mut Bencher) {
    let mut dq: VecDeque<_> = (0..1024).collect();
    /*
    for _ in 0..128 {
        let elt = dq.pop_back().unwrap();
        dq.push_front(elt);
    }
    */
    let dq = black_box(dq);

    b.iter(|| {
        let mut sum = 0;
        let (a, b) = dq.as_slices();
        for elt in a {
            sum += *elt;
        }
        for elt in b {
            sum += *elt;
        }
        sum
    })
}
@bluss bluss added A-collections Area: `std::collections` I-slow Issue: Problems and improvements with respect to performance of generated code. labels Jan 10, 2016
@bluss
Copy link
Member Author

bluss commented Jan 10, 2016

@dotdash The two for loops over slices in sum_deque_2 are very efficent, but I've never managed to build a rust iterator that optimizes like that (I guess llvm would have to recognize it as two consecutive loops). .chain() doesn't do it, and simpler or special cases like chain don't. It's a big iterator challenge!

I've run into the same in the ndarray crate. For non-contiguous arrays, the iterator wants to be by row, and then dispatch to the slice iterator for contiguous rows. I haven't been able to make a rust iterator that does this efficiently (I use explicit nested for loops to implement arithmetic ops instead).

@MagaTailor
Copy link

Could you clarify which SSE instruction set was used to get that 20x speedup?

On ARMv7 it's 2x before using NEON and 3x after so I'm not sure if it's the best llvm could do here. (possibly material for an llvm bug report)

Even on an old x86 cpu, using SSE (v1) yields a 2x speedup for a total of 5x.

@bluss
Copy link
Member Author

bluss commented Jan 10, 2016

My tested platform is x86-64, Sandy Bridge, but with default settings, so only default instructions used.

It's using paddd, the loop core in sum_deque_2 looks like this:

    f460:       f3 0f 6f 52 90          movdqu xmm2,XMMWORD PTR [rdx-0x70]
    f465:       f3 0f 6f 5a a0          movdqu xmm3,XMMWORD PTR [rdx-0x60]
    f46a:       f3 0f 6f 62 b0          movdqu xmm4,XMMWORD PTR [rdx-0x50]
    f46f:       f3 0f 6f 6a c0          movdqu xmm5,XMMWORD PTR [rdx-0x40]
    f474:       66 0f fe d0             paddd  xmm2,xmm0
    f478:       66 0f fe d9             paddd  xmm3,xmm1
    f47c:       66 0f fe d4             paddd  xmm2,xmm4
    f480:       66 0f fe dd             paddd  xmm3,xmm5
    f484:       f3 0f 6f 62 d0          movdqu xmm4,XMMWORD PTR [rdx-0x30]
    f489:       f3 0f 6f 6a e0          movdqu xmm5,XMMWORD PTR [rdx-0x20]
    f48e:       66 0f fe e2             paddd  xmm4,xmm2
    f492:       66 0f fe eb             paddd  xmm5,xmm3
    f496:       f3 0f 6f 42 f0          movdqu xmm0,XMMWORD PTR [rdx-0x10]
    f49b:       f3 0f 6f 0a             movdqu xmm1,XMMWORD PTR [rdx]
    f49f:       66 0f fe c4             paddd  xmm0,xmm4
    f4a3:       66 0f fe cd             paddd  xmm1,xmm5
    f4a7:       48 83 ea 80             sub    rdx,0xffffffffffffff80
    f4ab:       48 83 c3 e0             add    rbx,0xffffffffffffffe0
    f4af:       75 af                   jne    f460 <_ZN11sum_deque_220hbee41beab497848aadaE+0x2b0>

using -C target-cpu=native, sum_deque_2 improves further, switching to AVX(?) instructions vpadd and so on.

@MagaTailor
Copy link

Thanks, just the default SSE2 indeed.

Do you think it would be useful to create an llvm issue containing IR output from the two platforms side-by-side? The issue is here.

The llvm ARM guy is interested in some more arm data points, care to chime in @emoon, @warricksothr?

@bluss
Copy link
Member Author

bluss commented Jan 11, 2016

Your issue is only tangential to this issue, but you should make a more reduced test case if you only want to look at the loop's codegen. Here's what I'd look at, just the summation, which produces a big blob of autovectorized code on x86-64. https://play.rust-lang.org/?gist=8b542d1b17b8d69e44b7&version=nightly

@MagaTailor
Copy link

Thanks @bluss and sorry for the diversion. The fact unrolling is disabled on ARM by default was an interesting discovery. Using the -force-target-max-vector-interleave=4 LLVM option actually produces a small improvement. (and stops there for N > 4)

@bluss
Copy link
Member Author

bluss commented Dec 5, 2016

Triage: No change using the for loop, but due to the fold improvement for VecDeque's iterator, the idiomatic dq.iter().sum() is now as fast as it should be (same as the slice).

@Mark-Simulacrum Mark-Simulacrum added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Jul 24, 2017
@jonas-schievink jonas-schievink added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Dec 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-collections Area: `std::collections` C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

4 participants