Skip to content

openblas: +locking#21770

Merged
alalazo merged 2 commits intospack:developfrom
ax3l:topic-OpenBLASdefThread
Feb 23, 2021
Merged

openblas: +locking#21770
alalazo merged 2 commits intospack:developfrom
ax3l:topic-OpenBLASdefThread

Conversation

@ax3l
Copy link
Copy Markdown
Member

@ax3l ax3l commented Feb 18, 2021

BlasPP by ECP SLATE will fail to install by default (spack install blaspp) because

  • the default BLAS installation in Spack is OpenBLAS
  • BlasPP conflicts with threads=none for all recent OpenBLAS releases

Luckily, for OpenBLAS itself introduced a threadsafe compile option with 0.3.7+ aka USE_LOCKING:

   61 # If you want to build a single-threaded OpenBLAS, but expect to call this
   62 # from several concurrent threads in some other program, comment this in for
   63 # thread safety. (This is done automatically for USE_THREAD=1 , and should not
   64 # be necessary when USE_OPENMP=1)
   65 # USE_LOCKING = 1

According to my tests, with spack install --test root blaspp, this exactly addresses the issues in BlasPP tests.

It also seems to be a good option to set by default for OpenBLAS and users that do not need this safety net can always disable it.

cc @G-Ragghianti @mgates3

Follow-up to #20956

@G-Ragghianti
Copy link
Copy Markdown
Contributor

Yes, I agree. This seems to currently be the only way to fix this problem when installing blaspp. Is there a compelling reason to choose pthreads over openmp?

@G-Ragghianti
Copy link
Copy Markdown
Contributor

The alternative that I had attempted is something like this:

depends_on("blas");
depends_on("openblas threads=pthreads", when="^openblas");

But this doesn't work.

@sethrj
Copy link
Copy Markdown
Contributor

sethrj commented Feb 18, 2021

No, please don't! This is an issue for downstream parallel libraries like Trilinos that assume they manage the parallelism. Our particle transport codes use MPI and OpenMPI parallelism and make BLAS calls from each process/thread; with parallel BLAS that means extra threads stepping on each other.

If downstream apps rely on upstream parallel/thread options for a speed boost, then the spack recipes should specify those options. I guess that could be argued either way -- and it's made difficult by BLAS being a virtual package, not a concrete one. The reference netlib BLAS is unthreaded though...

@ax3l ax3l mentioned this pull request Feb 19, 2021
@ax3l
Copy link
Copy Markdown
Member Author

ax3l commented Feb 19, 2021

Is there a compelling reason to choose pthreads over openmp?

I picked pthread because openmp might be less portable as a default (looking at you, AppleClang 👀 )

depends_on("openblas threads=pthreads", when="^openblas");

That's right, Spack unfortunately does not support this syntax.

This is an issue for downstream parallel libraries like Trilinos

Ah dang it, thanks for the note. What do we do now? 😱
I need a somehow default-buildable BLAS++ for a specific geometry we use in WarpX #21787

Does someone know if USE_THREAD alone activates multi-threading in OpenBLAS? Or does it only add locks and mutexes?
I see conda-forge even activatates pthread & OpenMP at the same time.

I also found notes like these

   61 # If you want to build a single-threaded OpenBLAS, but expect to call this
   62 # from several concurrent threads in some other program, comment this in for
   63 # thread safety. (This is done automatically for USE_THREAD=1 , and should not
   64 # be necessary when USE_OPENMP=1)
   65 # USE_LOCKING = 1

(We don't set USE_LOCKING yet.)

And for Power8:

   67 # If you're going to use this library with OpenMP, please comment it in.
   68 # This flag is always set for POWER8. Don't set USE_OPENMP = 0 if you're targeting POWER8.
   69 # USE_OPENMP = 1

There is more:

   79 # You can define the maximum number of threads. Basically it should be less
   80 # than or equal to the number of CPU threads. If you don't specify one, it's
   81 # automatically detected by the build system.
   82 # If SMT (aka. HT) is enabled on the system, it may or may not be beneficial to 
   83 # restrict NUM_THREADS to the number of physical cores. By default, the automatic 
   84 # detection includes logical CPUs, thus allowing the use of SMT.
   85 # Users may opt at runtime to use less than NUM_THREADS threads.
   86 #
   87 # Note for package maintainers: you can build OpenBLAS with a large NUM_THREADS
   88 # value (eg. 32-256) if you expect your users to use that many threads. Due to the way
   89 # some internal structures are allocated, using a large NUM_THREADS value has a RAM
   90 # footprint penalty, even if users reduce the actual number of threads at runtime.
   91 # NUM_THREADS = 24
   92 
   93 # If you have enabled USE_OPENMP and your application would call
   94 # OpenBLAS's calculation API from multiple threads, please comment this in.
   95 # This flag defines how many instances of OpenBLAS's calculation API can actually
   96 # run in parallel. If more than NUM_PARALLEL threads call OpenBLAS's calculation API,
   97 # they need to wait for the preceding API calls to finish or risk data corruption.
   98 # NUM_PARALLEL = 2

NUM_THREADS: Users may opt at runtime to use less than NUM_THREADS threads. <-- hey Trilinos? :)

@alalazo
Copy link
Copy Markdown
Member

alalazo commented Feb 19, 2021

For this one:

depends_on("blas");
depends_on("openblas threads=pthreads", when="^openblas");

did you try with the ASP based solver too (if that doesn't work I would consider it a bug)?

@ax3l
Copy link
Copy Markdown
Member Author

ax3l commented Feb 19, 2021

Random question: does the ASP solver also recognize conflicts, so he does not need to specify threads=pthreads or threads=openmp?

@alalazo
Copy link
Copy Markdown
Member

alalazo commented Feb 19, 2021

@ax3l Yep

@ax3l
Copy link
Copy Markdown
Member Author

ax3l commented Feb 19, 2021

w/o this PR, a general spack solve blaspp works.

The result is might be odd:

[email protected]%[email protected]+cuda~ipo+openmp+shared build_type=RelWithDebInfo cuda_arch=none arch=linux-ubuntu20.04-skylake
    ^[email protected]%[email protected]~doc+ncurses+openssl+ownlibs~qt arch=linux-ubuntu20.04-skylake
        ^[email protected]%[email protected]~symlinks+termlib arch=linux-ubuntu20.04-skylake
            ^[email protected]%[email protected] arch=linux-ubuntu20.04-skylake
        ^[email protected]%[email protected]+systemcerts arch=linux-ubuntu20.04-skylake
            ^[email protected]%[email protected]+cpanm+shared+threads arch=linux-ubuntu20.04-skylake
                ^[email protected]%[email protected] arch=linux-ubuntu20.04-skylake
                ^[email protected]%[email protected] arch=linux-ubuntu20.04-skylake
                    ^[email protected]%[email protected] arch=linux-ubuntu20.04-skylake
            ^[email protected]%[email protected]+optimize+pic+shared arch=linux-ubuntu20.04-skylake
    ^[email protected]%[email protected] arch=linux-ubuntu20.04-skylake
    ^[email protected]%[email protected]~consistent_fpcsr~ilp64+pic+shared threads=none arch=linux-ubuntu20.04-skylake

It picks an old openblas instead of switching the thread variant.

Probably needs to be seen what is picked in context, e.g. with py-warpx ^warpx +psatd dims=rz #21787 (==> Error: 'python')

@G-Ragghianti
Copy link
Copy Markdown
Contributor

For this one:

depends_on("blas");
depends_on("openblas threads=pthreads", when="^openblas");

did you try with the ASP based solver too (if that doesn't work I would consider it a bug)?

I haven't tried the ASP solver (I assume it is only available for testing through "spack solve"?). What do I need to do to use it? I see that clingo is required, but when I installed and loaded it within spack, I still get some error "Error: module 'clingo' has no attribute 'Control'".

@sethrj
Copy link
Copy Markdown
Contributor

sethrj commented Feb 19, 2021

Out of curiosity -- I hadn't heard of blaspp before this but had been hoping that a better API for BLAS existed outside of Trilinos' kind of sketchy Teuchos wrappers. (There are numerous issues with BLAS and LAPACK in mixed-compiler environments, including e.g. the Apple-distributed VecLib, and windows/intel toolchains.) Is Blaspp intended to solve those problems?

For the reasons I mentioned above (working with apps that manage parallelism independently from the underlying linear algebra libraries) we wouldn't be able to use BLASpp if it requires BLAS-level multithreading. What is the underlying reason for the conflict with openblas threads=none? The package recipe just ays "tests will fail", but why would that be the case? Wouldn't the better solution to be a flag that informs BLASpp of thread support, add a +multithreaded (or whatever) variant to BLASpp, and update the BLASpp tests to support unthreaded BLAS installations?

@ax3l
Copy link
Copy Markdown
Member Author

ax3l commented Feb 19, 2021

@G-Ragghianti: I haven't tried the ASP solver (I assume it is only available for testing through "spack solve"?). What do I need to do to use it? I see that clingo is required, but when I installed and loaded it within spack, I still get some error "Error: module 'clingo' has no attribute 'Control'".

@becker33 just told me the way again to try the new solver (config ref):

spack install clingo@master,spack
spack load --first clingo@master,spack

# default value  config:concretizer:original
spack config add config:concretizer:clingo

BLAS++

@G-Ragghianti and @mgates3 are the perfect people to ask for your questions! It's part of ECP SLATE and early adopters that we are we use the nice C++ API in ECP WarpX :)

References for OpenBLAS issues: #20956 and (resolved/hidden) comments in #19311

@G-Ragghianti
Copy link
Copy Markdown
Contributor

Out of curiosity -- I hadn't heard of blaspp before this but had been hoping that a better API for BLAS existed outside of Trilinos' kind of sketchy Teuchos wrappers. (There are numerous issues with BLAS and LAPACK in mixed-compiler environments, including e.g. the Apple-distributed VecLib, and windows/intel toolchains.) Is Blaspp intended to solve those problems?

For the reasons I mentioned above (working with apps that manage parallelism independently from the underlying linear algebra libraries) we wouldn't be able to use BLASpp if it requires BLAS-level multithreading. What is the underlying reason for the conflict with openblas threads=none? The package recipe just ays "tests will fail", but why would that be the case? Wouldn't the better solution to be a flag that informs BLASpp of thread support, add a +multithreaded (or whatever) variant to BLASpp, and update the BLASpp tests to support unthreaded BLAS installations?

Hi Seth. @mgates3 is the lead on blaspp (and SLATE), so he can give you more details if needed. I believe that a recent openblas update introduced something that isn't thread safe unless you configure it with pthreads or openmp support. The reason blaspp can't automatically request thread support in openblas is that blaspp needs to be able to support other blas implementations though the "blas" virtual dependency. This causes a problem with the current default spack concretizer (see my "depends_on" example above). It may work with the new concretizer. I'm about to test that.

@sethrj
Copy link
Copy Markdown
Contributor

sethrj commented Feb 19, 2021

The package recipe just ays "tests will fail", but why would that be the case? Wouldn't the better solution to be a flag that informs BLASpp of thread support, add a +multithreaded (or whatever) variant to BLASpp, and update the BLASpp tests to support unthreaded BLAS installations?

Hi Seth. mgates3 is the lead on blaspp (and SLATE), so he can give you more details if needed. I believe that a recent openblas update introduced something that isn't thread safe unless you configure it with pthreads or openmp support. The reason blaspp can't automatically request thread support in openblas is that blaspp needs to be able to support other blas implementations though the "blas" virtual dependency. This causes a problem with the current default spack concretizer (see my "depends_on" example above). It may work with the new concretizer. I'm about to test that.

Thanks @G-Ragghianti . I understand the spack limitations on propagating the threads variant; but I'd like to know if it's possible to solve the underlying limitation preventing non-threaded OpenBLAS from working with BLASpp. It seems like that should be a standard use case. Even if it required a CMake configure toggle to disable thread support in BLASpp, the Spack problem would be solved -- you could easily set the necessary CMake flag based on the specs in def configure.

@ax3l
Copy link
Copy Markdown
Member Author

ax3l commented Feb 19, 2021

@mgates3 documented the issues here

@mgates3: We see an issue with OpenBLAS >= 0.3.6, verified through 0.3.13. It shows in our batched BLAS routines, which are doing regular BLAS inside an OpenMP parallel section. Regular BLAS routines are okay, as long as they aren't in an OpenMP parallel section. See OpenMathLib/OpenBLAS#3057

@ax3l: Thank you for reporting this upstream, @mgates3.
Looks like we might be able to work-around with locking, let's see where you arrive at with upstream.

More notes on virtual variants, see #1712 and #6300.

@G-Ragghianti @mgates3 have you had the change to verify the OpenBLAS USE_LOCKING option solves your observed test failures? We can easily add this to openblas in Spack and make it a requirement for OpenBLAS with BLAS++.

https://github.com/xianyi/OpenBLAS/blob/1caa44bea999e196918afc095d26ad418db4ba19/Makefile.rule#L61-L65

@G-Ragghianti
Copy link
Copy Markdown
Contributor

G-Ragghianti commented Feb 19, 2021

@G-Ragghianti @mgates3 have you had the change to verify the OpenBLAS USE_LOCKING option solves your observed test failures? We can easily add this to openblas in Spack and make it a requirement for OpenBLAS with BLAS++.

I haven't tested USE_LOCKING for openblas, but I will soon.

I can verify that the clingo concretizer does seem to correctly handle the following simultaneous dependencies:

depends_on('blas')
depends_on('openblas threads=openmp', when='^openblas')

This allows me to use a different blas provider with no complication.

@ax3l ax3l changed the title openblas: default +pthreads openblas: +locking Feb 20, 2021
BlasPP by ECP SLATE will fail to install by default
(`spack install blaspp`) because:
- the default BLAS installation in Spack is OpenBLAS
- BlasPP conflicts with `threads=none` for all recent OpenBLAS releases

Luckily, for OpenBLAS itself introduced a threadsafe compile option
with 0.3.7+ aka `USE_LOCKING`:
```
   61 # If you want to build a single-threaded OpenBLAS, but expect to call this
   62 # from several concurrent threads in some other program, comment this in for
   63 # thread safety. (This is done automatically for USE_THREAD=1 , and should not
   64 # be necessary when USE_OPENMP=1)
   65 # USE_LOCKING = 1
```

According to my tests, with `spack install --test root blaspp`,
this exactly addresses the issues in BlasPP tests.

It also seems to be a good option to set by default for OpenBLAS and
users that do not need this safety net can always disable it.
@ax3l
Copy link
Copy Markdown
Member Author

ax3l commented Feb 20, 2021

I would even go more concrete to use

depends_on('openblas threads=openmp', when='+openmp ^openblas')

(Works well with clingo, does fail with the original dependency solver)

I haven't tested USE_LOCKING for openblas, but I will soon.

Wuhu, thanks to the great test/check implementation in the blaspp package, I gave it a try already: spack install --test root blaspp. This is the issue I saw:

217 tests FAILED for batch-gemm
...
BLAS : Bad memory unallocation!

Indeed, enabling USE_LOCKING=1 introduced with OpenBLAS 0.3.7+ solves the test issues! 🎉

I updated this PR to reflect the changes in OpenBLAS and BLAS++ and will wait for your and @mgates3's confirmation that this indeed is the solution.

I already did spack install --test root blaspp tests with OpenBLAS (Ubuntu 20.04/x86_64, GCC 10.2, original concretizer):

0.3.6

     30       BLAS library not found.
  >> 31    CMake Error at CMakeLists.txt:295 (message):
     32      BLAS++ requires a BLAS library and none was found.  Ensure that it is
     33      accessible in environment variables $CPATH, $LIBRARY_PATH, and
     34      $LD_LIBRARY_PATH.

0.3.7

  • spack install --test root blaspp ^[email protected]
    same CMake error as with OpenBLAS 0.3.6

Otherwise this should work because of the +locking default that came in this OpenBLAS release.

  • spack install --test root blaspp ^[email protected] threads=openmp
    same CMake error

  • Do we want to simply add a conflict for selected old OpenBLAS releases in BLAS++'s package.py? Maybe that's also solved with newer BLAS++ releases that reworked the CMake, but the nwest release is still 2020.10.02 (latest tag) @mgates3

0.3.8

0.3.9

0.3.13 (latest)

Probably the most important test, because this is the default user experience with Spack:

  • spack install --test root blaspp
    works

Just checking, because yolo:

  • spack install --test root blaspp ^openblas threads=openmp
    works

Any without locking

  • spack install --test root blaspp ^openblas ~locking
    conflict added for all OpenBLAS releases >=0.3.7 (see 0.3.6 note above)

@ax3l ax3l force-pushed the topic-OpenBLASdefThread branch from ef4955f to fe3ccfe Compare February 20, 2021 07:38
@alalazo
Copy link
Copy Markdown
Member

alalazo commented Feb 20, 2021

Lot of things happening overnight... 🙂 FYI, to bootstrap clingo there's also #21446 which "just" requires to have a compiler supporting C++14. It currently installs clingo in $HOME/.spack/bootstrap for the interpreter being used to run Spack and imports the module on the fly.

@G-Ragghianti clingo can be used beyond spack solve. You just have to set in your configuration:

config:
  concretizer: clingo

and ensure Spack can import the clingo module. There are still a few limitations to be lifted, and it's still slower than the original concretizer (see #21289 for possible improvements) but it can already work around conflicts and take dependency into account (as you noted above) in depends_on directives.

@ax3l
Copy link
Copy Markdown
Member Author

ax3l commented Feb 21, 2021

Lot of things happening overnight... 🙂

Friday evenings aren't as they used to be 😅
Fixed a little pep8 issue now.

Solve issues with newer OpenBLAS by requiring
`+locking` over none-default threading options.
@ax3l ax3l force-pushed the topic-OpenBLASdefThread branch from fe3ccfe to 9c4be71 Compare February 21, 2021 07:00
Copy link
Copy Markdown
Member

@alalazo alalazo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically LGTM, just a minor question.

conflicts('+consistent_fpcsr', when='threads=none',
msg='FPCSR consistency only applies to multithreading')

conflicts('threads=pthreads', when='~locking', msg='Pthread support requires +locking')
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we add also:

conflicts('threads=openmp', when='~locking', msg='OpenMP support requires +locking')

Copy link
Copy Markdown
Member Author

@ax3l ax3l Feb 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought so for a moment, but the OpenBLAS docs on USE_LOCKING read:

This is done automatically for USE_THREAD=1, and should not be necessary when USE_OPENMP=1.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, not talking about the recipe, just the spec. I think openblas threads=openmp ~locking won't make sense (but we can work on that in a later PR).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if I follow the docs that this is advertised as working, but I would not default to it reading the wording :D

@mgates3
Copy link
Copy Markdown

mgates3 commented Feb 22, 2021

No, please don't! This is an issue for downstream parallel libraries like Trilinos that assume they manage the parallelism. Our particle transport codes use MPI and OpenMPI parallelism and make BLAS calls from each process/thread; with parallel BLAS that means extra threads stepping on each other.

Oddly, this is exactly what we also want in BLAS++: that BLAS++ does the OpenMP parallelism for a batched-BLAS call, and BLAS itself is single threaded. But it seems (at least in some environments) that non-threaded OpenBLAS is not thread safe, so we get failures. Hence the conflict with openblas threads=none. Adding locking without threading would be sufficient for BLAS++. Same applies to SLATE, which is the parent project that BLAS++ was built within, that SLATE handles the parallelism and BLAS should generally be single threaded.

@mgates3
Copy link
Copy Markdown

mgates3 commented Feb 22, 2021

* [x]  `spack install --test root blaspp ^[email protected] threads=openmp`
  same CMake error

* [ ]  Do we want to simply add a conflict for selected old OpenBLAS releases in BLAS++'s `package.py`? Maybe that's also solved with newer BLAS++ releases that reworked the CMake, but the nwest release is still 2020.10.02 (latest tag) @mgates3

We can investigate these issues. I thought with threads=openmp it would be working. There have been some fixes in BLAS++ CMake, but not major CMake work since 2020.10.02, to my recollection. Most newer work has been adding ROCm support.

Mark

@sethrj
Copy link
Copy Markdown
Contributor

sethrj commented Feb 22, 2021

Interesting, I was assuming that openmp and pthreads both meant "use this threading implementation to maximize CPU utilization" rather than "be aware of downstream library's utilization of these threading implementations". For the OpenMP case, it sounds like it's the latter?

@ax3l ax3l requested a review from alalazo February 23, 2021 02:43
@alalazo alalazo merged commit bcc3701 into spack:develop Feb 23, 2021
@alalazo
Copy link
Copy Markdown
Member

alalazo commented Feb 23, 2021

Thanks @ax3l !

@ax3l ax3l deleted the topic-OpenBLASdefThread branch February 23, 2021 08:17
@ax3l
Copy link
Copy Markdown
Member Author

ax3l commented Feb 23, 2021

Thank you all, too! Epic that we got this solved! :)

@G-Ragghianti
Copy link
Copy Markdown
Contributor

I would even go more concrete to use

depends_on('openblas threads=openmp', when='+openmp ^openblas')

(Works well with clingo, does fail with the original dependency solver)

I haven't tested USE_LOCKING for openblas, but I will soon.

Wuhu, thanks to the great test/check implementation in the blaspp package, I gave it a try already: spack install --test root blaspp. This is the issue I saw:

217 tests FAILED for batch-gemm
...
BLAS : Bad memory unallocation!

Indeed, enabling USE_LOCKING=1 introduced with OpenBLAS 0.3.7+ solves the test issues! 🎉

I updated this PR to reflect the changes in OpenBLAS and BLAS++ and will wait for your and @mgates3's confirmation that this indeed is the solution.

I already did spack install --test root blaspp tests with OpenBLAS (Ubuntu 20.04/x86_64, GCC 10.2, original concretizer):

0.3.6

* [x]  `spack install --test root blaspp ^[email protected]`
  Spack abort by default with conflict (original concretizer).
  `USE_LOCKING` came in [OpenBLAS 0.3.7 according to their docs](https://github.com/xianyi/OpenBLAS/blob/v0.3.7/Makefile.rule#L61-L65), so I added a new specific conflict

* [x]  `spack install --test root blaspp ^[email protected] threads=openmp`
  BLAS++ 2020.10.02 (latest tag) CMake issue:
     30       BLAS library not found.
  >> 31    CMake Error at CMakeLists.txt:295 (message):
     32      BLAS++ requires a BLAS library and none was found.  Ensure that it is
     33      accessible in environment variables $CPATH, $LIBRARY_PATH, and
     34      $LD_LIBRARY_PATH.

0.3.7

* [x]  `spack install --test root blaspp ^[email protected]`
  same CMake error as with OpenBLAS 0.3.6

Otherwise this should work because of the +locking default that came in this OpenBLAS release.

* [x]  `spack install --test root blaspp ^[email protected] threads=openmp`
  same CMake error

* [ ]  Do we want to simply add a conflict for selected old OpenBLAS releases in BLAS++'s `package.py`? Maybe that's also solved with newer BLAS++ releases that reworked the CMake, but the nwest release is still 2020.10.02 (latest tag) @mgates3

0.3.8

* [x]  `spack install --test root blaspp ^[email protected]`
  works

0.3.9

* [x]  `spack install --test root blaspp ^[email protected]`
  works

0.3.13 (latest)

Probably the most important test, because this is the default user experience with Spack:

* [x]  `spack install --test root blaspp`
  works

Just checking, because yolo:

* [x]  `spack install --test root blaspp ^openblas threads=openmp`
  works

Any without locking

* [x]  `spack install --test root blaspp ^openblas ~locking`
  conflict added for all OpenBLAS releases >=0.3.7 (see 0.3.6 note above)

HI Axel. I've found that these cmake failures (unable to detect blas) only occur when using gcc >=10. We will be looking into it. Thanks for finding it.

@ax3l
Copy link
Copy Markdown
Member Author

ax3l commented Mar 4, 2021

Thanks @G-Ragghianti! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants