Skip to content

Threading performance issue on Windows #4063

@mseminatore

Description

@mseminatore

Using 0.3.23 pre-built x64 binary DLL. Also tested against locally built (via cmake/ninja/clang) DLL and static lib sync'd both to head of develop branch and to the v0.3.23 label. Code runs very slowly on Windows 10/11 tested against Sandybridge, Haswell and Zen2/threadripper. Tested against 4 cores, 8 cores, 24 cores and 128 cores.

Noted that on several machines with more than 8 cores only 1 core appears to be busy. On the Sandybridge machine 8 core all cores seem to be taking work.

The same code (both mine and OpenBLAS) runs dramatically (10x) faster on both M1 iMac as well as Intel MacPro. Could there be an issue with either the Windows build config or threading?

Also, if it set OPENBLAS_NUM_THREAD=1 on the Windows machines I see up to a 10x speedup.

My workload is a vectorized C NN library with the same NN structure and dataset as the TensorFlow MNIST Fashion tutorial. Primary CBLAS functions used are: SGER and SGEMV (over 65% execution time spent in those functions). Also used with much lower frequency: SAXPY and SAXPBY.

Following two runs were on the same machine. In this case a Windows 11 Surface Laptop but similar results seen across many different Windows 10/11 desktop machines.

OpenBLAS 0.3.23 DYNAMIC_ARCH NO_AFFINITY Haswell MAX_THREADS=64
CPU uArch: Haswell
Cores/Threads: 8/8

Training ANN

Network shape: 784-128-10
Optimizer: Adaptive SGD
Loss function: Categorical cross-entropy
Mini-batch size: 8
Training size: 60000 rows

Epoch 1/5
[====================] - loss: 0.31 - LR: 0.05
Epoch 2/5
[====================] - loss: 0.68 - LR: 0.047
Epoch 3/5
[====================] - loss: 0.12 - LR: 0.011
Epoch 4/5
[====================] - loss: 0.38 - LR: 0.016
Epoch 5/5
[====================] - loss: 0.27 - LR: 0.0037

Training time: 140.000000 seconds, 0.466667 ms/step
Test accuracy: 86.97%

C:\dev\NN>set OPENBLAS_NUM_THREADS=1

C:\dev\NN>ann
OpenBLAS 0.3.23 DYNAMIC_ARCH NO_AFFINITY Haswell MAX_THREADS=64
CPU uArch: Haswell
Cores/Threads: 8/1

Training ANN

Network shape: 784-128-10
Optimizer: Adaptive SGD
Loss function: Categorical cross-entropy
Mini-batch size: 8
Training size: 60000 rows

Epoch 1/5
[====================] - loss: 0.31 - LR: 0.05
Epoch 2/5
[====================] - loss: 0.68 - LR: 0.047
Epoch 3/5
[====================] - loss: 0.12 - LR: 0.011
Epoch 4/5
[====================] - loss: 0.38 - LR: 0.016
Epoch 5/5
[====================] - loss: 0.27 - LR: 0.0037

Training time: 13.000000 seconds, 0.043333 ms/step
Test accuracy: 86.97%

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions