Vectorize non-contiguous unary operations #8488

cpuhrsch · 2018-06-14T16:41:36Z

Timings are on a single core. This increases performance on non-contiguous tensors and also introduces the dlopen bugfix for glibc2.23. I'm comparing to conda intel which comes with mkl vml. It's based on this commit for pytorch's benchmark repo.

Command

OMP_NUM_THREADS=1 taskset -c 0,40 python run.py --include NumpyComparison --benchmark-min-time 1 --benchmark-out result.csv

Timings for this branch

framework                                   NumPy          Torch     Ratio  Better
cont  trans function       dim mag
False False ('cos', 'cos') 3   1         6.310349       6.771949  0.931837   False
                               3        12.634588      10.322803  1.223949    True
                               6     10299.704082    8064.798387  1.277119    True
                               7    110513.100000   96036.545455  1.150740    True
            ('exp', 'exp') 3   1         6.477992       6.827134  0.948860   False
                               3        16.686640       9.990119  1.670315    True
                               6     13033.610390    7340.496350  1.775576    True
                               7    137158.500000   89514.916667  1.532242    True
            ('log', 'log') 3   1         6.364118       6.930657  0.918256   False
                               3        17.396047       9.848998  1.766276    True
                               6     15274.075758    7652.954198  1.995840    True
                               7    159386.857143   92383.454545  1.725275    True
            ('sin', 'sin') 3   1         6.298647       6.779787  0.929033   False
                               3        12.298670      10.299140  1.194145    True
                               6     10647.148936    8156.365854  1.305379    True
                               7    114057.000000   97171.818182  1.173766    True
      True  ('cos', 'cos') 3   1         6.642220       7.266582  0.914078   False
                               3        13.836532      20.914805  0.661566   False
                               6     13777.602740    9458.358491  1.456659    True
                               7    153595.714286  102202.300000  1.502860    True
            ('exp', 'exp') 3   1         6.833274       7.232200  0.944840   False
                               3        17.882019      19.649125  0.910067   False
                               6     17613.649123    8795.368421  2.002605    True
                               7    192875.500000   95177.454545  2.026483    True
            ('log', 'log') 3   1         6.727563       7.402366  0.908839   False
                               3        18.332869      20.912888  0.876630   False
                               6     19654.823529    8995.785714  2.184892    True
                               7    214381.800000   97150.545455  2.206697    True
            ('sin', 'sin') 3   1         6.696391       7.255480  0.922942   False
                               3        13.992233      20.934322  0.668387   False
                               6     14259.211268    9433.254717  1.511590    True
                               7    158259.285714  100854.700000  1.569181    True
True  False ('cos', 'cos') 3   1         5.451552       5.808338  0.938573   False
                               3        11.722829       7.211628  1.625545    True
                               6      1366.180328    1368.277702  0.998467   False
                               7     15121.134328   15160.833333  0.997381   False
            ('exp', 'exp') 3   1         5.661175       5.787654  0.978147   False
                               3        14.307734       6.509026  2.198138    True
                               6       661.407407     657.211564  1.006384    True
                               7      9322.444444    9297.796296  1.002651    True
            ('log', 'log') 3   1         5.472372       5.759750  0.950106   False
                               3        15.269631       6.705778  2.277086    True
                               6       877.996488     878.066725  0.999920   False
                               7     10498.708333   10456.833333  1.004005    True
            ('sin', 'sin') 3   1         5.454824       5.736337  0.950925   False
                               3        11.120482       7.188580  1.546965    True
                               6      1373.364883    1377.009629  0.997353   False
                               7     14258.267606   14320.771429  0.995635   False
      True  ('cos', 'cos') 3   1         6.725612       7.284210  0.923314   False
                               3        12.472111      18.216174  0.684672   False
                               6      8231.040984    2782.111111  2.958559    True
                               7     86455.250000   24163.428571  3.577938    True
            ('exp', 'exp') 3   1         6.810451       7.177235  0.948896   False
                               3        16.781367      16.938591  0.990718   False
                               6     12028.500000    2078.879418  5.786050    True
                               7    126168.375000   16491.786885  7.650376    True
            ('log', 'log') 3   1         6.872798       7.318716  0.939072   False
                               3        18.108228      17.763702  1.019395    True
                               6     14101.042254    2317.958333  6.083389    True
                               7    146356.571429   18708.240741  7.823107    True
            ('sin', 'sin') 3   1         6.676422       7.192040  0.928307   False
                               3        12.926711      18.286454  0.706901   False
                               6      8658.715517    2936.346041  2.948806    True
                               7     90595.500000   24111.261905  3.757394    True

Timings for master

framework                                   NumPy          Torch     Ratio  Better
cont  trans function       dim mag
False False ('cos', 'cos') 3   1         6.586941       7.759424  0.848896   False
                               3        12.744532      13.246761  0.962087   False
                               6     10424.447917   11819.717647  0.881954   False
                               7    111012.200000  125720.875000  0.883005   False
            ('exp', 'exp') 3   1         6.734793       7.862136  0.856611   False
                               3        22.828356      24.064508  0.948632   False
                               6     12778.556962   16185.145161  0.789524   False
                               7    135787.375000  167084.500000  0.812687   False
            ('log', 'log') 3   1         6.774166       7.958301  0.851208   False
                               3        24.137598      25.305520  0.953847   False
                               6     15302.045455   18744.296296  0.816357   False
                               7    158471.428571  192788.500000  0.821996   False
            ('sin', 'sin') 3   1         6.589613       7.822480  0.842394   False
                               3        13.120115      13.850776  0.947248   False
                               6     10715.500000   12301.036585  0.871105   False
                               7    114386.444444  128376.250000  0.891025   False
      True  ('cos', 'cos') 3   1         7.063273       6.764206  1.044213    True
                               3        14.788647      13.938441  1.060997    True
                               6     13761.972603   11991.773810  1.147618    True
                               7    153566.857143  127468.875000  1.204740    True
            ('exp', 'exp') 3   1         7.282833       6.797468  1.071404    True
                               3        18.717540      19.730844  0.948644   False
                               6     17420.517241   16536.426230  1.053463    True
                               7    189548.833333  168339.333333  1.125993    True
            ('log', 'log') 3   1         7.298477       6.925113  1.053914    True
                               3        20.269497      21.096909  0.960780   False
                               6     19703.313725   19280.038462  1.021954    True
                               7    212846.400000  195164.333333  1.090601    True
            ('sin', 'sin') 3   1         7.092799       6.850010  1.035443    True
                               3        14.821764      14.470412  1.024281    True
                               6     14263.816901   12509.237500  1.140263    True
                               7    156744.000000  132050.125000  1.187004    True
True  False ('cos', 'cos') 3   1         5.275213       5.935803  0.888711   False
                               3        11.152936       7.268619  1.534395    True
                               6      1365.641201    1364.477490  1.000853    True
                               7     15163.393939   15178.666667  0.998994   False
            ('exp', 'exp') 3   1         5.399442       6.112014  0.883414   False
                               3        14.124011       6.799218  2.077299    True
                               6       666.608667     662.673956  1.005938    True
                               7      9339.148148    9381.785047  0.995455   False
            ('log', 'log') 3   1         5.370895       5.992497  0.896270   False
                               3        15.068251       6.929594  2.174478    True
                               6       883.460247     884.798408  0.998488   False
                               7     10559.210526   10656.936170  0.990830   False
            ('sin', 'sin') 3   1         5.343930       6.017947  0.887999   False
                               3        10.789219       7.319043  1.474130    True
                               6      1379.637241    1383.666667  0.997088   False
                               7     14616.724638   14630.289855  0.999073   False
      True  ('cos', 'cos') 3   1        12.344486      10.990152  1.123232    True
                               3        12.943676      13.241339  0.977520   False
                               6      8218.237705    9542.552381  0.861220   False
                               7     86500.666667   96038.363636  0.900689   False
            ('exp', 'exp') 3   1         7.315555       6.772219  1.080230    True
                               3        17.222295      19.164488  0.898657   False
                               6     11814.258824   15118.895522  0.781423   False
                               7    123850.222222  154562.428571  0.801296   False
            ('log', 'log') 3   1         7.326224       6.892506  1.062926    True
                               3        18.713280      20.415297  0.916630   False
                               6     14049.000000   17884.982143  0.785519   False
                               7    146420.000000  182284.000000  0.803252   False
            ('sin', 'sin') 3   1        12.294462      10.954392  1.122332    True
                               3        19.056337      17.742964  1.074022    True
                               6      8634.318966   10068.000000  0.857600   False
                               7     90584.750000  101639.500000  0.891236   False

Intel pstat was used to turn off turbo mode and set min and max freq to 100.

lscpu

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                80
On-line CPU(s) list:   0-79
Thread(s) per core:    2
Core(s) per socket:    20
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
Stepping:              1
CPU MHz:               2200.257
CPU max MHz:           3600.0000
CPU min MHz:           1200.0000
BogoMIPS:              4400.92
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              51200K
NUMA node0 CPU(s):     0-19,40-59
NUMA node1 CPU(s):     20-39,60-79
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts

colesbury · 2018-06-14T20:01:25Z

Looks good, but the MacOS build failure looks legit

aten/src/ATen/cpu/vec256/vec256_double.h

aten/src/ATen/cpu/vec256/vec256_float.h

cpuhrsch · 2018-06-17T19:49:40Z

@pytorchbot retest this please

colesbury

Worth a comment in the code about why cosh and sinh are disabled.

…ion and _mm256_sqrt_ps. Currently for single-precision floating-point number, division and _mm256_sqrt_ps are used. While the current method is more accurate, using _mm256_rsqrt_ps is faster. See pytorch#8488 (comment) . This commit switches to _mm256_rsqrt_ps, because: - This is more consistent with the CUDA implementation, which uses rsqrt, which is the fast but less accurate version: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#standard-functions - Users who call torch.rsqrt() is more likely to seek for speed than accuracy. Otherwise, they can use torch.sqrt().reciprocal(). Another possibility would be to add an option to decide which version to use, as rsqrt has a faster derivative.

cpuhrsch requested review from apaszke, colesbury, ezyang, gchanan, soumith and zdevito as code owners June 14, 2018 16:41

cpuhrsch force-pushed the vnunary branch 2 times, most recently from d632ed8 to dadc1ac Compare June 14, 2018 18:09

cpuhrsch force-pushed the vnunary branch 2 times, most recently from 8869762 to 55512ec Compare June 16, 2018 16:55

vishwakftw reviewed Jun 16, 2018

View reviewed changes

aten/src/ATen/cpu/vec256/vec256_double.h Outdated

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

vishwakftw reviewed Jun 16, 2018

View reviewed changes

aten/src/ATen/cpu/vec256/vec256_float.h Outdated

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

cpuhrsch force-pushed the vnunary branch 3 times, most recently from 8d063e0 to a1bebe3 Compare June 18, 2018 17:35

colesbury approved these changes Jun 18, 2018

View reviewed changes

cpuhrsch force-pushed the vnunary branch 5 times, most recently from 51fa0c2 to fcf65a3 Compare June 19, 2018 01:08

Vectorize non-contiguous unary operations

a4aef20

cpuhrsch force-pushed the vnunary branch from fcf65a3 to a4aef20 Compare June 19, 2018 01:31

cpuhrsch added 2 commits June 18, 2018 19:34

Apple guard

208e699

Remove abs for Apple

3e52dd3

cpuhrsch merged commit 7a048cd into pytorch:master Jun 19, 2018

cpuhrsch mentioned this pull request May 30, 2019

Move legacy TH functions(sinh,cosh) to TensorIterator + Vec256 #21115

Closed

xuhdev mentioned this pull request Aug 28, 2019

Use _mm256_rsqrt_ps for rsqrt on CPU for float instead of using division and _mm256_sqrt_ps. #25287

Closed

ezyang mentioned this pull request Sep 23, 2019

Vectorized complex unary and binary op support. #26500

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vectorize non-contiguous unary operations #8488

Vectorize non-contiguous unary operations #8488

Uh oh!

cpuhrsch commented Jun 14, 2018

Uh oh!

colesbury commented Jun 14, 2018

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

cpuhrsch commented Jun 17, 2018

Uh oh!

colesbury left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Vectorize non-contiguous unary operations #8488

Vectorize non-contiguous unary operations #8488

Uh oh!

Conversation

cpuhrsch commented Jun 14, 2018

Uh oh!

colesbury commented Jun 14, 2018

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

cpuhrsch commented Jun 17, 2018

Uh oh!

colesbury left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants