Skip to content

ARROW-10010: [Rust] Speedup arithmetic (1.3-1.9x)#8191

Closed
jorgecarleitao wants to merge 4 commits intoapache:masterfrom
jorgecarleitao:divide_simd_faster
Closed

ARROW-10010: [Rust] Speedup arithmetic (1.3-1.9x)#8191
jorgecarleitao wants to merge 4 commits intoapache:masterfrom
jorgecarleitao:divide_simd_faster

Conversation

@jorgecarleitao
Copy link
Copy Markdown
Member

This PR speeds-up arithmetic ops by leveraging vectorization of non-divide operations (in non-SIMD), as well as removing an un-needed operation in SIMD division.

For non-SIMD, this yields about [-30%,-45%] for all operations (+-*/)
For SIMD, this yields about -30% on division.

The culprit in non-SIMD was that we required the operation to return Result<T::Native>, which was not allowing the compiler to vectorize the operation. Only the division requires Result. For divide, removing the operator further speed up the operation (I do not know the reason).

The culprit in SIMD was primarily a simd_load too many that was not doing anything.

Benchmarks

The benchmark used:

set -e
git checkout 0852869d1a9b7da4a1b91fa7cb7d4ef48e99cdec
cargo bench --bench arithmetic_kernels
git checkout divide_simd_faster
cargo bench --bench arithmetic_kernels
echo "##################################"
git checkout 0852869d1a9b7da4a1b91fa7cb7d4ef48e99cdec
cargo bench --bench arithmetic_kernels --features simd
git checkout divide_simd_faster
cargo bench --bench arithmetic_kernels --features simd

and below are the results for the execution of the second bench, which is the one that gives the differential, in my machine:

Non-SIMD

Previous HEAD position was 0852869d1 Improved benches for arithmetic.
Switched to branch 'divide_simd_faster'
   Compiling arrow v2.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow)
    Finished bench [optimized] target(s) in 37.24s
     Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/arithmetic_kernels-d281862a43faaf38
Gnuplot not found, using plotters backend
add 512                 time:   [1.4714 us 1.4758 us 1.4803 us]                     
                        change: [-44.446% -43.969% -43.522%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high severe

subtract 512            time:   [1.4825 us 1.4844 us 1.4866 us]                          
                        change: [-45.351% -45.018% -44.686%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe

multiply 512            time:   [1.4895 us 1.4936 us 1.4990 us]                          
                        change: [-44.822% -44.135% -43.479%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe

divide 512              time:   [1.9742 us 1.9773 us 1.9810 us]                        
                        change: [-33.273% -32.688% -32.052%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  7 (7.00%) high mild
  7 (7.00%) high severe

limit 512, 512          time:   [374.66 ns 375.64 ns 376.53 ns]                           
                        change: [-0.1000% +0.4442% +0.9503%] (p = 0.10 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) low severe
  2 (2.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe

add_nulls_512           time:   [1.4880 us 1.4982 us 1.5115 us]                           
                        change: [-44.084% -43.116% -42.111%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  3 (3.00%) high mild
  13 (13.00%) high severe

divide_nulls_512        time:   [1.9731 us 1.9758 us 1.9790 us]                              
                        change: [-33.404% -32.570% -31.416%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe

SIMD

divide is the only relevant

Previous HEAD position was 0852869d1 Improved benches for arithmetic.
Switched to branch 'divide_simd_faster'
   Compiling arrow v2.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow)
    Finished bench [optimized] target(s) in 38.63s
     Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/arithmetic_kernels-b8dc1739cfb5ae36
Gnuplot not found, using plotters backend
add 512                 time:   [879.31 ns 883.95 ns 889.17 ns]                     
                        change: [-0.2041% +0.6502% +1.5484%] (p = 0.15 > 0.05)
                        No change in performance detected.
Found 16 outliers among 100 measurements (16.00%)
  5 (5.00%) high mild
  11 (11.00%) high severe

subtract 512            time:   [864.99 ns 866.95 ns 868.95 ns]                          
                        change: [-4.8531% -4.1561% -3.5163%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

multiply 512            time:   [862.85 ns 864.87 ns 867.71 ns]                          
                        change: [-3.8532% -3.1774% -2.4459%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high severe

divide 512              time:   [1.9703 us 1.9771 us 1.9843 us]                        
                        change: [-30.046% -29.457% -28.903%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

limit 512, 512          time:   [368.89 ns 369.96 ns 370.96 ns]                           
                        change: [-1.9574% -1.0063% -0.0347%] (p = 0.04 < 0.05)
                        Change within noise threshold.
Found 26 outliers among 100 measurements (26.00%)
  5 (5.00%) low severe
  6 (6.00%) low mild
  9 (9.00%) high mild
  6 (6.00%) high severe

add_nulls_512           time:   [871.97 ns 876.99 ns 883.57 ns]                           
                        change: [-5.1106% -3.6889% -2.3080%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe

divide_nulls_512        time:   [1.9582 us 1.9625 us 1.9678 us]                              
                        change: [-34.188% -33.161% -32.136%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe

@github-actions
Copy link
Copy Markdown

let null_bit_buffer =
combine_option_bitmap(left.data_ref(), right.data_ref(), left.len())?;
let bitmap = null_bit_buffer.map(Bitmap::from);
let bitmap = null_bit_buffer.clone().map(Bitmap::from);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this clone necessary?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not, but I was unable to get a bitmap reference to set the mask for SIMD from a buffer without clone.

@nevi-me nevi-me closed this in 49e5b46 Sep 15, 2020
@jorgecarleitao jorgecarleitao deleted the divide_simd_faster branch September 15, 2020 09:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants