Improve performance of interleave_primitive (-15% - 45%) / interleave_bytes (-10-25%) #7420

Dandandan · 2025-04-18T12:47:45Z

Which issue does this PR close?

interleave i32(0.0) 100 [0..100, 100..230, 450..1000]
                        time:   [222.48 ns 223.18 ns 223.84 ns]
                        change: [-1.6267% -1.1263% -0.6239%] (p = 0.00 < 0.05)
                        Change within noise threshold.

interleave i32(0.0) 400 [0..100, 100..230, 450..1000]
                        time:   [450.60 ns 451.23 ns 451.74 ns]
                        change: [-12.127% -11.970% -11.794%] (p = 0.00 < 0.05)
                        Performance has improved.

interleave i32(0.0) 1024 [0..100, 100..230, 450..1000]
                        time:   [998.30 ns 999.38 ns 1.0002 µs]
                        change: [-16.738% -15.398% -14.601%] (p = 0.00 < 0.05)
                        Performance has improved.

interleave i32(0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]
                        time:   [986.54 ns 987.56 ns 988.80 ns]
                        change: [-15.563% -15.399% -15.254%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

interleave i32(0.5) 100 [0..100, 100..230, 450..1000]
                        time:   [465.60 ns 465.82 ns 466.07 ns]
                        change: [-26.376% -26.205% -26.041%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  8 (8.00%) low severe
  1 (1.00%) high mild

interleave i32(0.5) 400 [0..100, 100..230, 450..1000]
                        time:   [1.0934 µs 1.0938 µs 1.0942 µs]
                        change: [-39.139% -38.791% -38.459%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
  7 (7.00%) low severe
  1 (1.00%) low mild
  8 (8.00%) high mild
  2 (2.00%) high severe

interleave i32(0.5) 1024 [0..100, 100..230, 450..1000]
                        time:   [2.4511 µs 2.4528 µs 2.4549 µs]
                        change: [-45.029% -44.382% -43.886%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 17 outliers among 100 measurements (17.00%)
  2 (2.00%) high mild
  15 (15.00%) high severe

interleave i32(0.5) 1024 [0..100, 100..230, 450..1000, 0..1000]
                        time:   [2.4604 µs 2.4643 µs 2.4688 µs]
                        change: [-45.677% -44.856% -44.155%] (p = 0.00 < 0.05)
                        Performance has improved.

interleave str(20, 0.0) 100 [0..100, 100..230, 450..1000]
                        time:   [820.15 ns 820.28 ns 820.42 ns]
                        change: [-9.6030% -9.4246% -9.2597%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

interleave str(20, 0.0) 400 [0..100, 100..230, 450..1000]
                        time:   [2.8129 µs 2.8159 µs 2.8204 µs]
                        change: [-9.4881% -9.4013% -9.3124%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low severe
  4 (4.00%) low mild
  1 (1.00%) high mild
  5 (5.00%) high severe

interleave str(20, 0.0) 1024 [0..100, 100..230, 450..1000]
                        time:   [6.9113 µs 6.9128 µs 6.9145 µs]
                        change: [-10.418% -10.342% -10.263%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe

Benchmarking interleave str(20, 0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]: Collecting 100 samples in estimated 5.0081 s (722k iterationsinterleave str(20, 0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]
                        time:   [6.9336 µs 6.9345 µs 6.9354 µs]
                        change: [-10.434% -10.327% -10.231%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) low mild
  2 (2.00%) high mild
  3 (3.00%) high severe

interleave str(20, 0.5) 100 [0..100, 100..230, 450..1000]
                        time:   [937.01 ns 938.97 ns 941.31 ns]
                        change: [-20.750% -20.527% -20.290%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

interleave str(20, 0.5) 400 [0..100, 100..230, 450..1000]
                        time:   [3.0452 µs 3.0519 µs 3.0592 µs]
                        change: [-23.362% -23.216% -23.061%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

interleave str(20, 0.5) 1024 [0..100, 100..230, 450..1000]
                        time:   [8.1354 µs 8.1484 µs 8.1620 µs]
                        change: [-23.524% -23.254% -22.982%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe

Benchmarking interleave str(20, 0.5) 1024 [0..100, 100..230, 450..1000, 0..1000]: Collecting 100 samples in estimated 5.0272 s (611k iterationsinterleave str(20, 0.5) 1024 [0..100, 100..230, 450..1000, 0..1000]
                        time:   [8.1120 µs 8.1211 µs 8.1321 µs]
                        change: [-22.049% -21.597% -21.157%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 19 outliers among 100 measurements (19.00%)
  2 (2.00%) high mild
  17 (17.00%) high severe

Closes #.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

mbutrovich

Awesome! Makes me want to look at other places we're using Vec::push. I dug into the generated code to see what's going on. In the tight loop a big win is likely not having to jump out to RawVec::grow_one. Nice find!

pr.txt
main.txt

Dandandan · 2025-04-18T15:41:50Z

Awesome! Makes me want to look at other places we're using Vec::push. I dug into the generated code to see what's going on. In the tight loop a big win is likely not having to jump out to RawVec::grow_one. Nice find! pr.txt main.txt

Thanks for checking! I wonder, is this screenshot from Compiler Explorer or a local tool?

Dandandan · 2025-04-18T15:47:37Z

I also pushed an improvement for interleave_bytes

Dandandan · 2025-04-18T15:56:53Z

arrow-select/src/interleave.rs

-    let mut offsets = BufferBuilder::<T::Offset>::new(indices.len() + 1);
-    offsets.append(T::Offset::from_usize(0).unwrap());
-    for (a, b) in indices {
+    let mut offsets = Vec::with_capacity(indices.len() + 1);


Vec and extend generates better code.
There is probably some other places this pattern can be applied as well @mbutrovich

mbutrovich · 2025-04-18T16:20:45Z

Thanks for checking! I wonder, is this screenshot from Compiler Explorer or a local tool?

It's Beyond Compare but you can use your diff tool of choice with the two .txt dumps of the machine code. I use cargo-show-asm to generate the relevant snippets of code.

For example, cargo asm -p arrow-select --lib --simplify shows all the functions, and then I look for interleave_primitive. It was function 934 so I did cargo asm -p arrow-select --lib --simplify 934 > ~/Desktop/pr.txt to get the pr.txt file. You may need to add the --context argument if your function calls out to other functions, or look for functions that call your function if it's been inlined.

I don't find that Compiler Explorer generates representative code for real projects. Putting snippets in there often doesn't reflect what the compiler does with large projects with complex CFG DAGs, external crates, LTO, and inlined functions.

Dandandan · 2025-04-18T16:29:20Z

arrow-select/src/interleave.rs

-                    builder.append(v)
-                }
-                builder.finish()
+                let nulls = BooleanBuffer::collect_bool(indices.len(), |i| {


This is a pattern that comes up often as well

Dandandan · 2025-04-18T16:56:47Z

Thanks for checking! I wonder, is this screenshot from Compiler Explorer or a local tool?

It's Beyond Compare but you can use your diff tool of choice with the two .txt dumps of the machine code. I use cargo-show-asm to generate the relevant snippets of code.

For example, cargo asm -p arrow-select --lib --simplify shows all the functions, and then I look for interleave_primitive. It was function 934 so I did cargo asm -p arrow-select --lib --simplify 934 > ~/Desktop/pr.txt to get the pr.txt file. You may need to add the --context argument if your function calls out to other functions, or look for functions that call your function if it's been inlined.

I don't find that Compiler Explorer generates representative code for real projects. Putting snippets in there often doesn't reflect what the compiler does with large projects with complex CFG DAGs, external crates, LTO, and inlined functions.

Ah nice, thanks for the overview! I'll try might try that in the future.

pacak · 2025-04-21T18:08:39Z

For example, cargo asm -p arrow-select --lib --simplify shows all the functions, and then I look for interleave_primitive. It was function 934 so I did cargo asm -p arrow-select --lib --simplify 934

If you know the name of the function or even part of it - you can specify all of it or parts as well: cargo asm -p arrow-select --lib --simplify interleave_pr, it should print only names that contain this substring - it should be easier to find the name you want since there will be less output. And then you can pass the number along with the name: cargo asm -p arrow-select --lib --simplify interleave_pr 934

Dandandan · 2025-04-21T18:22:52Z

For example, cargo asm -p arrow-select --lib --simplify shows all the functions, and then I look for interleave_primitive. It was function 934 so I did cargo asm -p arrow-select --lib --simplify 934

If you know the name of the function or even part of it - you can specify all of it or parts as well: cargo asm -p arrow-select --lib --simplify interleave_pr, it should print only names that contain this substring - it should be easier to find the name you want since there will be less output. And then you can pass the number along with the name: cargo asm -p arrow-select --lib --simplify interleave_pr 934

TY @pacak !

Improve interleave_primitive

ed022fd

github-actions bot added the arrow Changes to the arrow crate label Apr 18, 2025

Dandandan changed the title ~~Improve performance of interleave_primitive~~ [MINOR] Improve performance of interleave_primitive (-15%) Apr 18, 2025

Fmt

c6d8479

mbutrovich approved these changes Apr 18, 2025

View reviewed changes

Dandandan changed the title ~~[MINOR] Improve performance of interleave_primitive (-15%)~~ [MINOR] Improve performance of interleave_primitive (-15%) / interleave_bytes (-30-40%) Apr 18, 2025

Improve interleave_bytes

3850342

Dandandan force-pushed the improve_interleave branch from 5c51c91 to 3850342 Compare April 18, 2025 15:48

Dandandan commented Apr 18, 2025

View reviewed changes

Improve null handling

f3c32f5

Fmt

7534ef7

Dandandan changed the title ~~[MINOR] Improve performance of interleave_primitive (-15%) / interleave_bytes (-30-40%)~~ [MINOR] Improve performance of interleave_primitive (-15% - 45%) / interleave_bytes (-10-25%) Apr 18, 2025

Dandandan commented Apr 18, 2025

View reviewed changes

Just use Vec (no change in perf)

dee9afe

Dandandan changed the title ~~[MINOR] Improve performance of interleave_primitive (-15% - 45%) / interleave_bytes (-10-25%)~~ Improve performance of interleave_primitive (-15% - 45%) / interleave_bytes (-10-25%) Apr 18, 2025

Dandandan merged commit 5760049 into apache:main Apr 21, 2025
28 checks passed

alamb mentioned this pull request May 8, 2025

Improve performance of interleave_primitive and interleave_bytes #7421

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve performance of interleave_primitive (-15% - 45%) / interleave_bytes (-10-25%) #7420

Improve performance of interleave_primitive (-15% - 45%) / interleave_bytes (-10-25%) #7420

Uh oh!

Dandandan commented Apr 18, 2025 •

edited

Loading

Uh oh!

mbutrovich left a comment

Uh oh!

Dandandan commented Apr 18, 2025

Uh oh!

Dandandan commented Apr 18, 2025

Uh oh!

Dandandan Apr 18, 2025

Uh oh!

mbutrovich commented Apr 18, 2025 •

edited

Loading

Uh oh!

Dandandan Apr 18, 2025

Uh oh!

Dandandan commented Apr 18, 2025

Uh oh!

Uh oh!

pacak commented Apr 21, 2025

Uh oh!

Dandandan commented Apr 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Improve performance of interleave_primitive (-15% - 45%) / interleave_bytes (-10-25%) #7420

Improve performance of interleave_primitive (-15% - 45%) / interleave_bytes (-10-25%) #7420

Uh oh!

Conversation

Dandandan commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

mbutrovich left a comment

Choose a reason for hiding this comment

Uh oh!

Dandandan commented Apr 18, 2025

Uh oh!

Dandandan commented Apr 18, 2025

Uh oh!

Dandandan Apr 18, 2025

Choose a reason for hiding this comment

Uh oh!

mbutrovich commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dandandan Apr 18, 2025

Choose a reason for hiding this comment

Uh oh!

Dandandan commented Apr 18, 2025

Uh oh!

Uh oh!

pacak commented Apr 21, 2025

Uh oh!

Dandandan commented Apr 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Dandandan commented Apr 18, 2025 •

edited

Loading

mbutrovich commented Apr 18, 2025 •

edited

Loading