Threading performance issue on Windows

Using 0.3.23 pre-built x64 binary DLL. Also tested against locally built (via cmake/ninja/clang) DLL and static lib sync'd both to head of `develop` branch and to the `v0.3.23` label. Code runs very slowly on Windows 10/11 tested against Sandybridge, Haswell and Zen2/threadripper. Tested against 4 cores, 8 cores, 24 cores and 128 cores.

Noted that on several machines with more than 8 cores only 1 core appears to be busy. On the Sandybridge machine 8 core all cores seem to be taking work.

The same code (both mine and OpenBLAS) runs dramatically (10x) faster on both M1 iMac as well as Intel MacPro. Could there be an issue with either the Windows build config or threading?

Also, if it set OPENBLAS_NUM_THREAD=1 on the Windows machines I see up to a 10x speedup.

My workload is a vectorized C NN library with the same NN structure and dataset as the TensorFlow MNIST Fashion tutorial. Primary CBLAS functions used are: SGER and SGEMV (over 65% execution time spent in those functions). Also used with much lower frequency:  SAXPY and SAXPBY.

Following two runs were on the same machine. In this case a Windows 11 Surface Laptop but similar results seen across many different Windows 10/11 desktop machines.

OpenBLAS 0.3.23 DYNAMIC_ARCH NO_AFFINITY Haswell MAX_THREADS=64
      CPU uArch: Haswell
  Cores/Threads: 8/8

Training ANN
------------
  Network shape: 784-128-10
      Optimizer: Adaptive SGD
  Loss function: Categorical cross-entropy
Mini-batch size: 8
  Training size: 60000 rows

Epoch 1/5
[====================] - loss: 0.31 - LR: 0.05
Epoch 2/5
[====================] - loss: 0.68 - LR: 0.047
Epoch 3/5
[====================] - loss: 0.12 - LR: 0.011
Epoch 4/5
[====================] - loss: 0.38 - LR: 0.016
Epoch 5/5
[====================] - loss: 0.27 - LR: 0.0037

Training time: 140.000000 seconds, 0.466667 ms/step
Test accuracy: 86.97%

C:\dev\NN>set OPENBLAS_NUM_THREADS=1

C:\dev\NN>ann
OpenBLAS 0.3.23 DYNAMIC_ARCH NO_AFFINITY Haswell MAX_THREADS=64
      CPU uArch: Haswell
  Cores/Threads: 8/1

Training ANN
------------
  Network shape: 784-128-10
      Optimizer: Adaptive SGD
  Loss function: Categorical cross-entropy
Mini-batch size: 8
  Training size: 60000 rows

Epoch 1/5
[====================] - loss: 0.31 - LR: 0.05
Epoch 2/5
[====================] - loss: 0.68 - LR: 0.047
Epoch 3/5
[====================] - loss: 0.12 - LR: 0.011
Epoch 4/5
[====================] - loss: 0.38 - LR: 0.016
Epoch 5/5
[====================] - loss: 0.27 - LR: 0.0037

Training time: 13.000000 seconds, 0.043333 ms/step
Test accuracy: 86.97%

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Threading performance issue on Windows #4063

Training ANN

Training ANN

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Threading performance issue on Windows #4063

Description

Training ANN

Training ANN

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions