-
-
Notifications
You must be signed in to change notification settings - Fork 12.1k
Description
Describe the issue:
This was originally noticed in scikit-learn reported in scikit-learn/scikit-learn#27506.
From scikit-learn/scikit-learn#27506 (comment) this was noticed when Debian started with numpy 1.26 which introduced SIMD and also the array is not on a 64 bit boundary.
I don't know too much about the SIMD intricacies, but in an ideal world, it would be nice to understand if scikit-learn Cyhon code is to blame for not creating an array on a 64bit boundary (should it be 64 bit by the way on a 32bit OS?) or numpy should be able to deal with unaligned arrays better ...
From scikit-learn/scikit-learn#27506 (comment) here is a way to reproduce. I am guessing there is a way to reproduce without scikit-learn, for example with a memmap and an offset (we had unaligned memory array issues in joblib a while ago where I learned this the hard way, see joblib/joblib#563 if you are really curious).
Reproduce the code example:
docker command
docker build --progress plain --platform i386 .
Dockerfile
FROM docker.io/debian:sid-slim
RUN apt-get update && apt-get install -y --no-install-recommends python3-sklearn
RUN python3 -c '\
from sklearn.tree import DecisionTreeClassifier; \
\
X = [[-2, -1], [-1, -1], [-1, -2], [1, 1], [1, 2], [2, 1]]; \
y2 = [[-1, 1], [-1, 1], [-1, 1], [1, 2], [1, 2], [1, 3]]; \
w = [1, 1, 1, 0.5, 0.5, 0.5]; \
\
clf = DecisionTreeClassifier(max_depth=2, min_samples_split=2, criterion="gini", random_state=2); \
clf = clf.fit(X, y2, sample_weight=w); \
impurity = clf.tree_.impurity; \
print("impurity :", impurity); \
print("-impurity:", -impurity); \
'Output
impurity : [0.4691358 0. 0.22222222 0. 0. ]
-impurity: [-4.69135802e-001 -1.59149684e-314 -1.50000000e+000 -2.12199579e-314 nan]
You can tell on the second-line the result is garbage in particular it has a NaN (which is what actually allowed us to notice on scikit-learn I think).
Python and NumPy Versions:
1.26.4
3.11.9 (main, Apr 10 2024, 13:16:36) [GCC 13.2.0]
Runtime Environment:
[{'numpy_version': '1.26.4',
'python': '3.11.9 (main, Apr 10 2024, 13:16:36) [GCC 13.2.0]',
'uname': uname_result(system='Linux', node='buildkitsandbox', release='6.9.5-arch1-1', version='#1 SMP PREEMPT_DYNAMIC Sun, 16 Jun 2024 19:06:37 +0000', machine='x86_64')},
{'simd_extensions': {'baseline': ['SSE', 'SSE2'],
'found': ['SSE3',
'SSSE3',
'SSE41',
'POPCNT',
'SSE42',
'AVX',
'F16C',
'FMA3',
'AVX2'],
'not_found': ['AVX512F',
'AVX512CD',
'AVX512_KNL',
'AVX512_KNM',
'AVX512_SKX',
'AVX512_CLX',
'AVX512_CNL',
'AVX512_ICL',
'AVX512_SPR']}}]
Context for the issue:
I haven't taken the time to put together a stand-alone snippet with only numpy, let me know if that would help, and I can try to do this. Also I am guessing that this reproduces on Numpy main but I haven't tried ...