Skip to content

Conversation

@seberg
Copy link
Member

@seberg seberg commented Oct 30, 2025

If an arm system uses SME some BLAS versions just set FPEs spuriously. The culprit is really Accelerate so we might limit this to Mac OS as well (and then with an SME check also -- see changes prior to this commit).

However, some OpenBLAS versions also caused this, although OpenBLAS is likely to clear the FPEs on their side.

Closes gh-29820

If an arm system uses SME some BLAS versions just set FPEs spuriously.
The culprit is really Accelerate so we might limit this to Mac OS
as well (and then with an SME check also -- see changes prior to this
commit).

However, some OpenBLAS versions also caused this, although OpenBLAS
is likely to clear the FPEs on their side.

Closes numpygh-29820
@seberg seberg added the 09 - Backport-Candidate PRs tagged should be backported label Oct 30, 2025
* This checks always on all ARM (it is a small check overall).
*/
#if defined(__APPLE__) && defined(__aarch64__) && defined(ACCELERATE_NEW_LAPACK)
#if defined(__aarch64__)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure what to put here, so went with the broad __aarch64__. We could add __APPLE__ if we think this is largely Accelerate only (i.e. OpenBLAS considered this a bug on their side, but for now I am not sure...).

It's not like it is utterly terrible if we have the runtime check for the global...

@mattip can you test it on an M4, my M1 hits the checks, but wouldn't actually trigger the spurious FPEs.

#endif
CBLAS_FUNC(cblas_dgemm)(
CblasRowMajor, CblasNoTrans, CblasNoTrans, 20, 20, 20, 1.,
x, 20, y, 20, 0., res, 20);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the jobs segfaulted (previous run), did I get these parameters wrong?

@mattip
Copy link
Member

mattip commented Oct 30, 2025

Why do the check in C when we can do it in python in __init__.py::_sanity_check?

@seberg
Copy link
Member Author

seberg commented Oct 30, 2025

No reason, I think I just ended there because it seemed like the smaller change to keep some of the logic in-place to skip it on unaffected machines (and a single call is simple enough, although something is off on windows!?). Happy to just move it to Python fully as well.

@mattip
Copy link
Member

mattip commented Oct 30, 2025

I think it would be better in python, no need to worry about calling conventions and allocations. You could even limit the check to np._core._multiarray_umath.__cpu_features__['SVE'] is True. We don't currently have a SME check.

@seberg
Copy link
Member Author

seberg commented Oct 30, 2025

OK, let me just move it to Python.

You could even limit the check to np._core._multiarray_umath.__cpu_features__['SVE'] is True

I first had that, but then read this:

It supports the Scalable Matrix Extension (SME) but not the Scalable Vector Extension (SVE). Because of the lack of SVE support, the LLVM compiler officially flags the M4 as supporting ARMv8.7a.[7]

@mattip
Copy link
Member

mattip commented Oct 30, 2025

Ahh, oops, right. Here is what the M4 reports:

>>> [k for k,v in np._core._multiarray_umath.__cpu_features__.items() if v is True]
['NEON', 'NEON_FP16', 'NEON_VFPV4', 'ASIMD', 'FPHP', 'ASIMDHP', 'ASIMDDP']

@seberg
Copy link
Member Author

seberg commented Nov 13, 2025

Ahh, oops, right. Here is what the M4 reports:

Which of these do you actually want to add? Right now it's hardcoded to just ARM on the C side, although because of that I am also not sure I really prefer moving this to Python. Of course now it would be easier to add this, but I am not sure what (or if it even matters if we limit to ARM already).

@seberg seberg force-pushed the runtime-fpe-blas-check branch from 7fccc18 to 4c6f597 Compare November 13, 2025 19:28
@seberg seberg force-pushed the runtime-fpe-blas-check branch from 4c6f597 to 7c8625e Compare November 13, 2025 19:29
@charris
Copy link
Member

charris commented Nov 14, 2025

The circleci failure is unrelated.

Copy link
Member

@mattip mattip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could _blas_supports_fpe() (called with no argument) return the current state? Then

  • I would prefer we not blanket-disable FPEs on ARM64, and use the import test to disable it if needed.
  • add the current state of _blas_supports_fpe() to show_runtime().

x @ x
except FloatingPointError:
res = _core._multiarray_umath._blas_supports_fpe(False)
if res: # res was not modified (hardcoded to True for now)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

res is not modified?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_blas_supports_fpe

I made it _blas_supports_fpe(None) just to keep the C code a very bit simpler, didn't seem useful enough to care?

res is the state after changing it. If we are not on an ARM machine, it will still be True.

add the current state of _blas_supports_fpe() to show_runtime().

Hmmm, yeah that makes sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the show runtime

I would prefer we not blanket-disable FPEs on ARM64, and use the import test to disable it if needed.

Sorry, but what do you mean? On non ARM64, it's blanket enabled. On ARM64 it's disabled based on this test. I have not bothered to skip this test on non-ARM64, instead it might give that warning when things seem off.

@charris
Copy link
Member

charris commented Nov 14, 2025

Is this ready? I need to make a decision about backporting it.

@seberg
Copy link
Member Author

seberg commented Nov 14, 2025

I suspect so, unless you/Matti have thoughts on it. Unfortunately, I can't confirm for sure that it works :(.

@charris
Copy link
Member

charris commented Nov 15, 2025

I'll probably put this in, but wait for @mattip tomorrow. I do wonder why arm did things that way for sme, too hard?

@mattip
Copy link
Member

mattip commented Nov 15, 2025

Did we have reports of aarch64 (linux) emitting FPEs? I thought the problem was only on macOS.

@seberg
Copy link
Member Author

seberg commented Nov 16, 2025

Did we have reports of aarch64 (linux) emitting FPEs? I thought the problem was only on macOS.

Not yet at least, I thought openblas may well be broken on others also (But I guess openblas may work around it).

@mattip mattip changed the title ENH: Make FPE blas check a runtime check for all arm systems ENH: Make FPE blas check a runtime check for all apple arm systems Nov 16, 2025
@mattip mattip merged commit 804e2cd into numpy:main Nov 16, 2025
77 checks passed
@mattip
Copy link
Member

mattip commented Nov 16, 2025

Thanks @seberg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: "divide by zero encountered in matmul" on MacOS M4 with numpy v2.3.3

3 participants