Skip to content

BLD: use BUFFERSIZE=20 in OpenBLAS#17759

Merged
charris merged 1 commit intonumpy:masterfrom
mattip:openblas-buffersize
Nov 14, 2020
Merged

BLD: use BUFFERSIZE=20 in OpenBLAS#17759
charris merged 1 commit intonumpy:masterfrom
mattip:openblas-buffersize

Conversation

@mattip
Copy link
Copy Markdown
Member

@mattip mattip commented Nov 12, 2020

xref OpenMathLib/OpenBLAS#2970 where it was suggested to compile OpenBLAS with BUFFERSIZE=20 to revert the memory footprint to what it was in OpenBLAS 0.3.9 (we now use 0.3.12). This was done in MacPython/openblas-libs#46, and this PR uses it in NumPy.

xref issue gh-17674, gh-17684 which triggered the discussion. Once we have wheels that use this, we should ask the reporters on those issues @moylop260 and @MarkBel to try it out.

@charris charris merged commit c231355 into numpy:master Nov 14, 2020
@charris
Copy link
Copy Markdown
Member

charris commented Nov 14, 2020

Thanks Matti, let's see how it goes. The relevant wheels should get built tonight.

@charris charris added 03 - Maintenance 36 - Build Build related PR 09 - Backport-Candidate PRs tagged should be backported and removed 03 - Maintenance labels Nov 14, 2020
@mattip
Copy link
Copy Markdown
Member Author

mattip commented Nov 15, 2020

@moylop260
Copy link
Copy Markdown

moylop260 commented Nov 15, 2020

FYI our CIs is reproducing an error

I have not checked what is the output but I think it is related.

If you have a script to build the package like the wheel I can run it before to release it if you want.

Or if you want ssh access just write me

Regards!

Better, let me check if the error is related since that the numpy version installed is

numpy-1.19.4-cp36-cp36m-manylinux2010_x86_64.whl

@charris
Copy link
Copy Markdown
Member

charris commented Nov 16, 2020

@moylop260 You can install the nightly wheels like so:

python3 -mpip install -i https://pypi.anaconda.org/scipy-wheels-nightly/simple numpy

@moylop260
Copy link
Copy Markdown

python3 -mpip install -i https://pypi.anaconda.org/scipy-wheels-nightly/simple numpy

  • numpy==1.20.0.dev0+a645106

Result:

Starting program: /.repo_requirements/virtualenv/python3.6/bin/python3 /home/odoo/odoo-12.0/odoo-bin -d openerp_template -i benandfrank --xmlrpc-port=18069 --logfile=out.txt --workers=0 --max-cron-threads=0
warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7f5d94966700 (LWP 26925)]
[New Thread 0x7f5d8cb35700 (LWP 26927)]
[New Thread 0x7f5d8c334700 (LWP 26928)]
[New Thread 0x7f5d83b33700 (LWP 26929)]
[New Thread 0x7f5d7b332700 (LWP 26930)]
[New Thread 0x7f5d72b31700 (LWP 26931)]
[New Thread 0x7f5d6a330700 (LWP 26932)]
[New Thread 0x7f5d61b2f700 (LWP 26933)]
[New Thread 0x7f5d5932e700 (LWP 26934)]
[New Thread 0x7f5d50b2d700 (LWP 26935)]
[New Thread 0x7f5d4832c700 (LWP 26936)]
[New Thread 0x7f5d3fb2b700 (LWP 26937)]
[New Thread 0x7f5d3732a700 (LWP 26938)]
[New Thread 0x7f5d2eb29700 (LWP 26939)]
[New Thread 0x7f5d26328700 (LWP 26940)]
[New Thread 0x7f5d1db27700 (LWP 26941)]
[New Thread 0x7f5d15326700 (LWP 26942)]
[New Thread 0x7f5d0cb25700 (LWP 26943)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f5d15326700 (LWP 26942)]
0x0000000000000000 in ?? ()
(gdb)

@moylop260
Copy link
Copy Markdown

moylop260 commented Nov 17, 2020

Reverting to numpy==1.19.4 it works well again

@mattip
Copy link
Copy Markdown
Member Author

mattip commented Nov 17, 2020

@moylop260 when it segfaults, is the docker using all the memory allocated to it?

@mattip
Copy link
Copy Markdown
Member Author

mattip commented Nov 17, 2020

Also - could you try export OPENBLAS_CORETYPE=Haswell or export OPENBLAS_CORETYPE=Prescott to reduce the features used, maybe the cpu detection code is not working correctly.

@moylop260
Copy link
Copy Markdown

moylop260 commented Nov 18, 2020

without environment variables

python3 -m pip install memory_profiler

mprof run gdb --args python3 ~/odoo-12.0/odoo-bin

The result was after Program received signal SIGSEGV, Segmentation fault.:

  • image
[Thread 0x7fcfbcba8700 (LWP 28452) exited]
[Thread 0x7fcf78ba0700 (LWP 28460) exited]
[New Thread 0x7fced738d700 (LWP 28484)]
Unhandled exception in thread started by <bound method Thread._bootstrap of <Thread(odoo.service.httpd, initial daemon)>>
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 884, in _bootstrap
MemoryError
libgcc_s.so.1 must be installed for pthread_cancel to work

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7fced738d700 (LWP 28484)]
0x00007fcfd9cf6c37 in raise () from /lib/x86_64-linux-gnu/libc.so.6

with export OPENBLAS_CORETYPE=Haswell

[Thread 0x7f6f656c9700 (LWP 28393) exited]
[Thread 0x7f6f6e6cb700 (LWP 28391) exited]
[New Thread 0x7f6e776ad700 (LWP 28426)]
Unhandled exception in thread started by <bound method Thread._bootstrap of <Thread(odoo.service.httpd, initial daemon)>>
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 884, in _bootstrap
MemoryError
libgcc_s.so.1 must be installed for pthread_cancel to work

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7f6e776ad700 (LWP 28426)]
0x00007f6f7a016c37 in raise () from /lib/x86_64-linux-gnu/libc.so.6

image

with export OPENBLAS_CORETYPE=Prescott

[Thread 0x7f72e8d88700 (LWP 28736) exited]
[Thread 0x7f72e8587700 (LWP 28737) exited]
[New Thread 0x7f71f1d6a700 (LWP 28771)]
Unhandled exception in thread started by <bound method Thread._bootstrap of <Thread(odoo.service.httpd, initial daemon)>>
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 884, in _bootstrap
MemoryError
libgcc_s.so.1 must be installed for pthread_cancel to work

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7f71f1d6a700 (LWP 28771)]
0x00007f72f46d3c37 in raise () from /lib/x86_64-linux-gnu/libc.so.6

image

@moylop260
Copy link
Copy Markdown

moylop260 commented Nov 18, 2020

Using numpy==1.19.4

No error (I just forced a exit(1))

[Thread 0x7f78a4292700 (LWP 29320) exited]
[Thread 0x7f789ca8f700 (LWP 29323) exited]
[New Thread 0x7f7860a77700 (LWP 29352)]
[Thread 0x7f7860a77700 (LWP 29352) exited]
[Inferior 1 (process 29312) exited with code 01]
(gdb) q

image

@moylop260
Copy link
Copy Markdown

@charris @mattip

You have access to ssh with your github public keys.

You can use the following command to connect:

In order to reproduce the error just run the following command:

  • gdb --args python3 ~/odoo-13.0/odoo-bin -i account_loan

after 3 seconds you will see the error.

NOTE:
You can install new package using pip install... since that it is a virtualenv (don't require sudo)
You can uninstall or re-install what you want since that it docker-image is backed it.

You can use the following image:

But I can't reproduce it using other kind of processors even if they are using the same docker-version and so on.

But you are lucky and you can reproduce it so you can use:
docker pull vauxoo/numpy_memerror
docker run -it --name=numpy_memerror --entrypoint=bash vauxoo/numpy_memerror

Into the container:
/etc/init.d/postgresql start
gdb --args python3 ~/odoo-13.0/odoo-bin -i account_loan

@mattip
Copy link
Copy Markdown
Member Author

mattip commented Nov 18, 2020

Thanks. I tried it out. The machine seems to only have ~8.5GB memory free. With that little memory available, you should limit the number of threads OMP_NUM_THREADS=8 python3 ~/odoo-13.0/odoo-bin -i account_loan, which allows the program to run for more than 10 secs.

$ free -h
             total       used       free     shared    buffers     cached
Mem:          251G       243G       8.5G       7.5G        29G       141G
-/+ buffers/cache:        72G       179G
Swap:         2.2G       1.2G       1.0G

@moylop260
Copy link
Copy Markdown

@mattip

Running with the environment variable OMP_NUM_THREADS=8 it is running fine.

Thank you!

The weird part here is that we are not using numpy directly it is just imported import numpy
and in this point the memory is overloaded.

Is it an expected behaviour?

@moylop260
Copy link
Copy Markdown

moylop260 commented Nov 18, 2020

We have another server where all processors were used at 100% just importing

What environment variables should I set in order to fix entirely the processors and memory overload (Considering that we are using just import numpy)?

Is possible to set the lower possible values by default?

I mean, OMP_NUM_THREADS=1 by default in order to be compatible with all devices but if you like to use more resources so a customization of environment variables is required. (it is like most of database manager works, e.g. postgresql has by default lowest values)

I don't know I just trying to bypassing our production errors.

We have an auto-scaling server deploys but using import numpy consuming a lot of memory and processors a lot of server will be deploys even if it is not used.

Thanks in advance!

@charris
Copy link
Copy Markdown
Member

charris commented Nov 18, 2020

numpy directly it is just imported import numpy and in this point the memory is overloaded.

Sounds like there is some pre-allocation going on.

@bashtage
Copy link
Copy Markdown
Contributor

I mean, OMP_NUM_THREADS=1 by default in order to be compatible with all devices but if you like to use more resources so a customization of environment variables is required. (it is like most of database manager works, e.g. postgresql has by default lowest values)

That would be a real loss IMO for most users who expect BLAS in NumPy to be MT be default. It may be possible to consider something more modest like 8, similar to what NumExpr does.

@bashtage
Copy link
Copy Markdown
Contributor

We have an auto-scaling server deploys but using import numpy consuming a lot of memory and processors a lot of server will be deploys even if it is not used.

One solution to your problem is to build NumPy from source without BLAS. This will have minimal memory usage and will be just as fast as wheels if you don't use BLAS, which it sounds like you may not.

fernandahf added a commit to vauxoo-dev/docker-ubuntu-base that referenced this pull request Feb 3, 2021
Currently, if numpy is available in the modules even if you are not using it
Odoo try to compile and the system is down only for a type of processor

Currently we know 2 server reproducing the error:

    B&F-production
    Runbot

More info about:

numpy/numpy#17674
numpy/numpy#17759

It is reproducing in the following MR:

https://git.vauxoo.com/vauxoo/lasec/-/merge_requests/197

Check the following discussion https://odoo-community.org/groups/contributors-15/contributors-186006?mode=thread&date_begin=&date_end=

OpenBLAS creates a number of threads equal to the number of core threads available: 56 in my case (production server),
so it quickly reached limit_memory_hard
and the process was killed (SIGSEGV)
Forcing OPENBLAS_NUM_THREADS=1 fixed the issue.
fernandahf added a commit to vauxoo-dev/docker-ubuntu-base that referenced this pull request Feb 3, 2021
Currently, if numpy is available in the modules even if you are not using it
Odoo try to compile and the system is down only for a type of processor

Currently we know 2 server reproducing the error:

    B&F-production
    Runbot

More info about:

numpy/numpy#17674
numpy/numpy#17759

It is reproducing in the following MR:

https://git.vauxoo.com/vauxoo/lasec/-/merge_requests/197

Check the following discussion https://odoo-community.org/groups/contributors-15/contributors-186006?mode=thread&date_begin=&date_end=

OpenBLAS creates a number of threads equal to the number of core threads available: 56 in my case (production server),
so it quickly reached limit_memory_hard
and the process was killed (SIGSEGV)
Forcing OPENBLAS_NUM_THREADS=1 fixed the issue.
@mattip mattip deleted the openblas-buffersize branch April 8, 2021 11:13
fernandahf added a commit to vauxoo-dev/docker-ubuntu-base that referenced this pull request Apr 15, 2021
Currently, if numpy is available in the modules even if you are not using it
Odoo try to compile and the system is down only for a type of processor

Currently we know 2 server reproducing the error:

    B&F-production
    Runbot

More info about:

numpy/numpy#17674
numpy/numpy#17759

It is reproducing in the following MR:

https://git.vauxoo.com/vauxoo/lasec/-/merge_requests/197

Check the following discussion https://odoo-community.org/groups/contributors-15/contributors-186006?mode=thread&date_begin=&date_end=

OpenBLAS creates a number of threads equal to the number of core threads available: 56 in my case (production server),
so it quickly reached limit_memory_hard
and the process was killed (SIGSEGV)
Forcing OPENBLAS_NUM_THREADS=1 fixed the issue.
fernandahf added a commit to vauxoo-dev/maintainer-quality-tools that referenced this pull request Apr 19, 2021
The reason of websocket-client was deactivated is:

numpy has the following issue:
 - numpy/numpy#13059

It is a corner case using a kind of processor, using docker and using python3

More info about:

 - numpy/numpy#17674
 - numpy/numpy#17759

But who is using numpy?

There are different projects using libraries that depends of numpy:
./web/requirements.txt:2:bokeh==1.1.0
./reporting-engine/requirements.txt:1:altair
./icm/requirements.txt:1:pandas
./maintainer-quality-tools/requirements.txt:7:websocket-client

So, if we run odoo-bin with loglevel=debug to know what is the last line before to crash.

It was the path:
 - Last logging https://github.com/odoo/odoo/blob/92ef3b2dd4655913198d10d06598b799fdcae6d0/odoo/modules/loading.py#L152
 - Using pdb I have trace the following line https://github.com/odoo/odoo/blob/92ef3b2dd4655913198d10d06598b799fdcae6d0/odoo/modules/module.py#L368
 - The last module imported was `resource` https://github.com/odoo/odoo/tree/92ef3b2dd4655913198d10d06598b799fdcae6d0/addons/resource
 - I removed all imports and I reproduced the error again and again because one change fixed the issue It was when I commented the following import: https://github.com/odoo/odoo/blob/92ef3b2dd4655913198d10d06598b799fdcae6d0/addons/resource/tests/common.py#L4
 - I started to debug in this file line by line, so finally I found the problematic import: https://github.com/odoo/odoo/blob/92ef3b2dd4655913198d10d06598b799fdcae6d0/odoo/tests/common.py#L50
- But now what is the reason this is raising the error. It is because the following line: https://github.com/websocket-client/websocket-client/blob/29c15714ac9f5272e1adefc9c99b83420b409f63/websocket/_abnf.py#L34
is importing numpy if you are using python3
numpy is installed because of the requirements.txt files above and the disaster was did.

We could have removed all numpy requirements but there are a lot of them.
But we decided that better option was avoid to import the websocket line that import numpy (faster solution) non-installing websocket-client.

However, after researching, we found that:

OpenBLAS creates a number of threads equal to the number of core threads available,
so it quickly reached limit_memory_hard
and the process was killed (SIGSEGV)
Forcing OPENBLAS_NUM_THREADS=1 fixed the issue.

After a test building an image to reproduce the error and using that environment variable and it was fixed.

That change was applied in following PRs:

 - Vauxoo/docker-ubuntu-base#89
 - Vauxoo/docker-ubuntu-base#90

With change applied in docker-ubuntu-base, it's not neccesary avoid to import
websocket-client (allow JS tests work again),
we are covered with env var OPENBLAS_NUM_THREADS.
moylop260 pushed a commit to Vauxoo/maintainer-quality-tools that referenced this pull request Apr 19, 2021
The reason of websocket-client was deactivated is:

numpy has the following issue:
 - numpy/numpy#13059

It is a corner case using a kind of processor, using docker and using python3

More info about:

 - numpy/numpy#17674
 - numpy/numpy#17759

But who is using numpy?

There are different projects using libraries that depends of numpy:
./web/requirements.txt:2:bokeh==1.1.0
./reporting-engine/requirements.txt:1:altair
./icm/requirements.txt:1:pandas
./maintainer-quality-tools/requirements.txt:7:websocket-client

So, if we run odoo-bin with loglevel=debug to know what is the last line before to crash.

It was the path:
 - Last logging https://github.com/odoo/odoo/blob/92ef3b2dd4655913198d10d06598b799fdcae6d0/odoo/modules/loading.py#L152
 - Using pdb I have trace the following line https://github.com/odoo/odoo/blob/92ef3b2dd4655913198d10d06598b799fdcae6d0/odoo/modules/module.py#L368
 - The last module imported was `resource` https://github.com/odoo/odoo/tree/92ef3b2dd4655913198d10d06598b799fdcae6d0/addons/resource
 - I removed all imports and I reproduced the error again and again because one change fixed the issue It was when I commented the following import: https://github.com/odoo/odoo/blob/92ef3b2dd4655913198d10d06598b799fdcae6d0/addons/resource/tests/common.py#L4
 - I started to debug in this file line by line, so finally I found the problematic import: https://github.com/odoo/odoo/blob/92ef3b2dd4655913198d10d06598b799fdcae6d0/odoo/tests/common.py#L50
- But now what is the reason this is raising the error. It is because the following line: https://github.com/websocket-client/websocket-client/blob/29c15714ac9f5272e1adefc9c99b83420b409f63/websocket/_abnf.py#L34
is importing numpy if you are using python3
numpy is installed because of the requirements.txt files above and the disaster was did.

We could have removed all numpy requirements but there are a lot of them.
But we decided that better option was avoid to import the websocket line that import numpy (faster solution) non-installing websocket-client.

However, after researching, we found that:

OpenBLAS creates a number of threads equal to the number of core threads available,
so it quickly reached limit_memory_hard
and the process was killed (SIGSEGV)
Forcing OPENBLAS_NUM_THREADS=1 fixed the issue.

After a test building an image to reproduce the error and using that environment variable and it was fixed.

That change was applied in following PRs:

 - Vauxoo/docker-ubuntu-base#89
 - Vauxoo/docker-ubuntu-base#90

With change applied in docker-ubuntu-base, it's not neccesary avoid to import
websocket-client (allow JS tests work again),
we are covered with env var OPENBLAS_NUM_THREADS.
fernandahf added a commit to vauxoo-dev/docker-odoo-image that referenced this pull request Apr 19, 2021
Currently, if numpy is available in the modules even if you are not using it
Odoo try to compile and the system is down only for a type of processor

Currently we know 2 server reproducing the error:

    B&F-production
    Runbot

More info about:

numpy/numpy#17674
numpy/numpy#17759

It is reproducing in the following MR:

https://git.vauxoo.com/vauxoo/lasec/-/merge_requests/197

Check the following discussion https://odoo-community.org/groups/contributors-15/contributors-186006?mode=thread&date_begin=&date_end=

OpenBLAS creates a number of threads equal to the number of core threads available: 56 in my case (production server),
so it quickly reached limit_memory_hard
and the process was killed (SIGSEGV)
Forcing OPENBLAS_NUM_THREADS=1 fixed the issue.
fernandahf added a commit to vauxoo-dev/docker-odoo-image that referenced this pull request Apr 19, 2021
Currently, if numpy is available in the modules even if you are not using it
Odoo try to compile and the system is down only for a type of processor

Currently we know 2 server reproducing the error:

    B&F-production
    Runbot

More info about:

numpy/numpy#17674
numpy/numpy#17759

It is reproducing in the following MR:

https://git.vauxoo.com/vauxoo/lasec/-/merge_requests/197

Check the following discussion https://odoo-community.org/groups/contributors-15/contributors-186006?mode=thread&date_begin=&date_end=

OpenBLAS creates a number of threads equal to the number of core threads available: 56 in my case (production server),
so it quickly reached limit_memory_hard
and the process was killed (SIGSEGV)
Forcing OPENBLAS_NUM_THREADS=1 fixed the issue.
moylop260 pushed a commit to Vauxoo/docker-odoo-image that referenced this pull request Apr 19, 2021
Currently, if numpy is available in the modules even if you are not using it
Odoo try to compile and the system is down only for a type of processor

Currently we know 2 server reproducing the error:

    B&F-production
    Runbot

More info about:

numpy/numpy#17674
numpy/numpy#17759

It is reproducing in the following MR:

https://git.vauxoo.com/vauxoo/lasec/-/merge_requests/197

Check the following discussion https://odoo-community.org/groups/contributors-15/contributors-186006?mode=thread&date_begin=&date_end=

OpenBLAS creates a number of threads equal to the number of core threads available: 56 in my case (production server),
so it quickly reached limit_memory_hard
and the process was killed (SIGSEGV)
Forcing OPENBLAS_NUM_THREADS=1 fixed the issue.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants