Skip to content

Conversation

@Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Apr 18, 2025

Which issue does this PR close?

Closes #7421

interleave i32(0.0) 100 [0..100, 100..230, 450..1000]
                        time:   [222.48 ns 223.18 ns 223.84 ns]
                        change: [-1.6267% -1.1263% -0.6239%] (p = 0.00 < 0.05)
                        Change within noise threshold.

interleave i32(0.0) 400 [0..100, 100..230, 450..1000]
                        time:   [450.60 ns 451.23 ns 451.74 ns]
                        change: [-12.127% -11.970% -11.794%] (p = 0.00 < 0.05)
                        Performance has improved.

interleave i32(0.0) 1024 [0..100, 100..230, 450..1000]
                        time:   [998.30 ns 999.38 ns 1.0002 µs]
                        change: [-16.738% -15.398% -14.601%] (p = 0.00 < 0.05)
                        Performance has improved.

interleave i32(0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]
                        time:   [986.54 ns 987.56 ns 988.80 ns]
                        change: [-15.563% -15.399% -15.254%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

interleave i32(0.5) 100 [0..100, 100..230, 450..1000]
                        time:   [465.60 ns 465.82 ns 466.07 ns]
                        change: [-26.376% -26.205% -26.041%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  8 (8.00%) low severe
  1 (1.00%) high mild

interleave i32(0.5) 400 [0..100, 100..230, 450..1000]
                        time:   [1.0934 µs 1.0938 µs 1.0942 µs]
                        change: [-39.139% -38.791% -38.459%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
  7 (7.00%) low severe
  1 (1.00%) low mild
  8 (8.00%) high mild
  2 (2.00%) high severe

interleave i32(0.5) 1024 [0..100, 100..230, 450..1000]
                        time:   [2.4511 µs 2.4528 µs 2.4549 µs]
                        change: [-45.029% -44.382% -43.886%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 17 outliers among 100 measurements (17.00%)
  2 (2.00%) high mild
  15 (15.00%) high severe

interleave i32(0.5) 1024 [0..100, 100..230, 450..1000, 0..1000]
                        time:   [2.4604 µs 2.4643 µs 2.4688 µs]
                        change: [-45.677% -44.856% -44.155%] (p = 0.00 < 0.05)
                        Performance has improved.

interleave str(20, 0.0) 100 [0..100, 100..230, 450..1000]
                        time:   [820.15 ns 820.28 ns 820.42 ns]
                        change: [-9.6030% -9.4246% -9.2597%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

interleave str(20, 0.0) 400 [0..100, 100..230, 450..1000]
                        time:   [2.8129 µs 2.8159 µs 2.8204 µs]
                        change: [-9.4881% -9.4013% -9.3124%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low severe
  4 (4.00%) low mild
  1 (1.00%) high mild
  5 (5.00%) high severe

interleave str(20, 0.0) 1024 [0..100, 100..230, 450..1000]
                        time:   [6.9113 µs 6.9128 µs 6.9145 µs]
                        change: [-10.418% -10.342% -10.263%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe

Benchmarking interleave str(20, 0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]: Collecting 100 samples in estimated 5.0081 s (722k iterationsinterleave str(20, 0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]
                        time:   [6.9336 µs 6.9345 µs 6.9354 µs]
                        change: [-10.434% -10.327% -10.231%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) low mild
  2 (2.00%) high mild
  3 (3.00%) high severe

interleave str(20, 0.5) 100 [0..100, 100..230, 450..1000]
                        time:   [937.01 ns 938.97 ns 941.31 ns]
                        change: [-20.750% -20.527% -20.290%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

interleave str(20, 0.5) 400 [0..100, 100..230, 450..1000]
                        time:   [3.0452 µs 3.0519 µs 3.0592 µs]
                        change: [-23.362% -23.216% -23.061%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

interleave str(20, 0.5) 1024 [0..100, 100..230, 450..1000]
                        time:   [8.1354 µs 8.1484 µs 8.1620 µs]
                        change: [-23.524% -23.254% -22.982%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe

Benchmarking interleave str(20, 0.5) 1024 [0..100, 100..230, 450..1000, 0..1000]: Collecting 100 samples in estimated 5.0272 s (611k iterationsinterleave str(20, 0.5) 1024 [0..100, 100..230, 450..1000, 0..1000]
                        time:   [8.1120 µs 8.1211 µs 8.1321 µs]
                        change: [-22.049% -21.597% -21.157%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 19 outliers among 100 measurements (19.00%)
  2 (2.00%) high mild
  17 (17.00%) high severe

Closes #.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Apr 18, 2025
@Dandandan Dandandan changed the title Improve performance of interleave_primitive [MINOR] Improve performance of interleave_primitive (-15%) Apr 18, 2025
Copy link
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Makes me want to look at other places we're using Vec::push. I dug into the generated code to see what's going on. In the tight loop a big win is likely not having to jump out to RawVec::grow_one. Nice find!
Screenshot 2025-04-18 at 10 32 46 AM
pr.txt
main.txt

@Dandandan
Copy link
Contributor Author

Awesome! Makes me want to look at other places we're using Vec::push. I dug into the generated code to see what's going on. In the tight loop a big win is likely not having to jump out to RawVec::grow_one. Nice find! Screenshot 2025-04-18 at 10 32 46 AM pr.txt main.txt

Thanks for checking! I wonder, is this screenshot from Compiler Explorer or a local tool?

@Dandandan Dandandan changed the title [MINOR] Improve performance of interleave_primitive (-15%) [MINOR] Improve performance of interleave_primitive (-15%) / interleave_bytes (-30-40%) Apr 18, 2025
@Dandandan
Copy link
Contributor Author

I also pushed an improvement for interleave_bytes

@Dandandan Dandandan force-pushed the improve_interleave branch from 5c51c91 to 3850342 Compare April 18, 2025 15:48
let mut offsets = BufferBuilder::<T::Offset>::new(indices.len() + 1);
offsets.append(T::Offset::from_usize(0).unwrap());
for (a, b) in indices {
let mut offsets = Vec::with_capacity(indices.len() + 1);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vec and extend generates better code.
There is probably some other places this pattern can be applied as well @mbutrovich

@mbutrovich
Copy link
Contributor

mbutrovich commented Apr 18, 2025

Thanks for checking! I wonder, is this screenshot from Compiler Explorer or a local tool?

It's Beyond Compare but you can use your diff tool of choice with the two .txt dumps of the machine code. I use cargo-show-asm to generate the relevant snippets of code.

For example, cargo asm -p arrow-select --lib --simplify shows all the functions, and then I look for interleave_primitive. It was function 934 so I did cargo asm -p arrow-select --lib --simplify 934 > ~/Desktop/pr.txt to get the pr.txt file. You may need to add the --context argument if your function calls out to other functions, or look for functions that call your function if it's been inlined.

I don't find that Compiler Explorer generates representative code for real projects. Putting snippets in there often doesn't reflect what the compiler does with large projects with complex CFG DAGs, external crates, LTO, and inlined functions.

@Dandandan Dandandan changed the title [MINOR] Improve performance of interleave_primitive (-15%) / interleave_bytes (-30-40%) [MINOR] Improve performance of interleave_primitive (-15% - 45%) / interleave_bytes (-10-25%) Apr 18, 2025
builder.append(v)
}
builder.finish()
let nulls = BooleanBuffer::collect_bool(indices.len(), |i| {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pattern that comes up often as well

@Dandandan
Copy link
Contributor Author

Thanks for checking! I wonder, is this screenshot from Compiler Explorer or a local tool?

It's Beyond Compare but you can use your diff tool of choice with the two .txt dumps of the machine code. I use cargo-show-asm to generate the relevant snippets of code.

For example, cargo asm -p arrow-select --lib --simplify shows all the functions, and then I look for interleave_primitive. It was function 934 so I did cargo asm -p arrow-select --lib --simplify 934 > ~/Desktop/pr.txt to get the pr.txt file. You may need to add the --context argument if your function calls out to other functions, or look for functions that call your function if it's been inlined.

I don't find that Compiler Explorer generates representative code for real projects. Putting snippets in there often doesn't reflect what the compiler does with large projects with complex CFG DAGs, external crates, LTO, and inlined functions.

Ah nice, thanks for the overview! I'll try might try that in the future.

@Dandandan Dandandan changed the title [MINOR] Improve performance of interleave_primitive (-15% - 45%) / interleave_bytes (-10-25%) Improve performance of interleave_primitive (-15% - 45%) / interleave_bytes (-10-25%) Apr 18, 2025
@Dandandan Dandandan merged commit 5760049 into apache:main Apr 21, 2025
28 checks passed
@pacak
Copy link

pacak commented Apr 21, 2025

For example, cargo asm -p arrow-select --lib --simplify shows all the functions, and then I look for interleave_primitive. It was function 934 so I did cargo asm -p arrow-select --lib --simplify 934

If you know the name of the function or even part of it - you can specify all of it or parts as well: cargo asm -p arrow-select --lib --simplify interleave_pr, it should print only names that contain this substring - it should be easier to find the name you want since there will be less output. And then you can pass the number along with the name: cargo asm -p arrow-select --lib --simplify interleave_pr 934

@Dandandan
Copy link
Contributor Author

For example, cargo asm -p arrow-select --lib --simplify shows all the functions, and then I look for interleave_primitive. It was function 934 so I did cargo asm -p arrow-select --lib --simplify 934

If you know the name of the function or even part of it - you can specify all of it or parts as well: cargo asm -p arrow-select --lib --simplify interleave_pr, it should print only names that contain this substring - it should be easier to find the name you want since there will be less output. And then you can pass the number along with the name: cargo asm -p arrow-select --lib --simplify interleave_pr 934

TY @pacak !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve performance of interleave_primitive and interleave_bytes

3 participants