Skip to content

BUG: array negation (i.e. minus sign) is wrong on 32 bit OS (likely related to SIMD on non memory-aligned array) #26775

@lesteve

Description

@lesteve

Describe the issue:

This was originally noticed in scikit-learn reported in scikit-learn/scikit-learn#27506.

From scikit-learn/scikit-learn#27506 (comment) this was noticed when Debian started with numpy 1.26 which introduced SIMD and also the array is not on a 64 bit boundary.

I don't know too much about the SIMD intricacies, but in an ideal world, it would be nice to understand if scikit-learn Cyhon code is to blame for not creating an array on a 64bit boundary (should it be 64 bit by the way on a 32bit OS?) or numpy should be able to deal with unaligned arrays better ...

From scikit-learn/scikit-learn#27506 (comment) here is a way to reproduce. I am guessing there is a way to reproduce without scikit-learn, for example with a memmap and an offset (we had unaligned memory array issues in joblib a while ago where I learned this the hard way, see joblib/joblib#563 if you are really curious).

Reproduce the code example:

docker command

docker build --progress plain --platform i386 .

Dockerfile

FROM docker.io/debian:sid-slim

RUN apt-get update && apt-get install -y --no-install-recommends python3-sklearn
RUN python3 -c '\
from sklearn.tree import DecisionTreeClassifier; \
\
X = [[-2, -1], [-1, -1], [-1, -2], [1, 1], [1, 2], [2, 1]]; \
y2 = [[-1, 1], [-1, 1], [-1, 1], [1, 2], [1, 2], [1, 3]]; \
w = [1, 1, 1, 0.5, 0.5, 0.5]; \
\
clf = DecisionTreeClassifier(max_depth=2, min_samples_split=2, criterion="gini", random_state=2); \
clf = clf.fit(X, y2, sample_weight=w); \
impurity = clf.tree_.impurity; \
print("impurity :", impurity); \
print("-impurity:", -impurity); \
'

Output

impurity : [0.4691358  0.         0.22222222 0.         0.        ]
-impurity: [-4.69135802e-001 -1.59149684e-314 -1.50000000e+000 -2.12199579e-314 nan]

You can tell on the second-line the result is garbage in particular it has a NaN (which is what actually allowed us to notice on scikit-learn I think).

Python and NumPy Versions:

1.26.4
3.11.9 (main, Apr 10 2024, 13:16:36) [GCC 13.2.0]

Runtime Environment:

[{'numpy_version': '1.26.4',
  'python': '3.11.9 (main, Apr 10 2024, 13:16:36) [GCC 13.2.0]',
  'uname': uname_result(system='Linux', node='buildkitsandbox', release='6.9.5-arch1-1', version='#1 SMP PREEMPT_DYNAMIC Sun, 16 Jun 2024 19:06:37 +0000', machine='x86_64')},
{'simd_extensions': {'baseline': ['SSE', 'SSE2'],
                     'found': ['SSE3',
                                'SSSE3',
                                'SSE41',
                                'POPCNT',
                                'SSE42',
                                'AVX',
                                'F16C',
                                'FMA3',
                                'AVX2'],
                      'not_found': ['AVX512F',
                                    'AVX512CD',
                                    'AVX512_KNL',
                                    'AVX512_KNM',
                                    'AVX512_SKX',
                                    'AVX512_CLX',
                                    'AVX512_CNL',
                                    'AVX512_ICL',
                                    'AVX512_SPR']}}]

Context for the issue:

I haven't taken the time to put together a stand-alone snippet with only numpy, let me know if that would help, and I can try to do this. Also I am guessing that this reproduces on Numpy main but I haven't tried ...

Metadata

Metadata

Assignees

Labels

00 - Bugcomponent: SIMDIssues in SIMD (fast instruction sets) code or machinery

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions