Skip to content

CI Fix scipy-dev build#28326

Merged
glemaitre merged 9 commits intoscikit-learn:mainfrom
lesteve:scipy-dev
Feb 1, 2024
Merged

CI Fix scipy-dev build#28326
glemaitre merged 9 commits intoscikit-learn:mainfrom
lesteve:scipy-dev

Conversation

@lesteve
Copy link
Member

@lesteve lesteve commented Jan 31, 2024

This fixes a DeprecationWarning in Numpy 2.0 a -> S. There may be other errors since scipy-dev has not been run without errors for a while ...

Close #28194.

@github-actions
Copy link

github-actions bot commented Jan 31, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 02cadbf. Link to the linter CI: here

@lesteve
Copy link
Member Author

lesteve commented Feb 1, 2024

So the hdbscan one seems recent since it was not in my previous commit, this seems like a numpy change that somehow changes the HDBSCan labels, to be investigated:

numpy dev

n_clusters=3
labels={np.int64(0), np.int64(1), np.int64(2), np.int64(-1)}
centers=[(0.0, 0.0), (3.0, 3.0)]
hdb.centroids_=array([[ 3.00974728,  2.99780391],
       [-0.05690994,  0.00368199],
       [ 1.0671027 , -0.61185795]])

numpy release

n_clusters=2
labels={0, 1}
centers=[(0.0, 0.0), (3.0, 3.0)]
hdb.centroids_=array([[-0.04358032, -0.0044484 ],
       [ 3.00974728,  2.99780391]])

@lesteve
Copy link
Member Author

lesteve commented Feb 1, 2024

So It looks like np.argsort has slightly changed in numpy dev (likely the optimization in numpy/numpy#25610). This changes the centroids in HDBSCAN. Adding more samples makes the test more robust.

@lesteve
Copy link
Member Author

lesteve commented Feb 1, 2024

A snippet that shows the numpy dev change:

import hashlib
import numpy as np
hashlib.sha256(np.argsort(np.random.default_rng(0).integers(100, size=10_000)).tobytes()).hexdigest()

numpy 1.26.3

97d7de848fa6a7df48d2d0ec7608f0a772c64f8a4d9e33c2c754c5be6030beaf

numpy dev

dcc8d89375bc7463d539315c342a6a9b40493451116a531b1e97949cc6f44ce8

np.argsort use quicksort by default which is not stable and we use np.argsort it with default parameter in HDBSCAN so I guess there was no guarantee. The np.argsort optimization in numpy-dev is changing the output enough that HDBScan finds 3 centroids instead of 2.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI seems happy. Thanks @lesteve

"""
centers = [(0.0, 0.0), (3.0, 3.0)]
H, _ = make_blobs(n_samples=1000, random_state=0, centers=centers, cluster_std=0.5)
H, _ = make_blobs(n_samples=2000, random_state=0, centers=centers, cluster_std=0.5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh wow, I thought this is more like, increasing from 10 samples to 100. But with 1000 samples and two dimensions, this should be quite stable, having it so unstable is concerning. I understand why this is though, but still.

cc @Micky774

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this seems to be the most brittle piece. I'm not sure why, but it should really only be used for illustrative purposes anyways (with exceptions if you really know the shape of your data). Even then it still is too fickle for full comfort. I'd love to investigate in detail if I get the time for it 😭

@glemaitre glemaitre added this to the 1.4.1 milestone Feb 1, 2024
@glemaitre glemaitre added the To backport PR merged in master that need a backport to a release branch defined based on the milestone. label Feb 1, 2024
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @lesteve for the comments. Merging since the CIs already passed before.

@glemaitre glemaitre enabled auto-merge (squash) February 1, 2024 15:45
@glemaitre glemaitre merged commit 2fa7575 into scikit-learn:main Feb 1, 2024
@lesteve lesteve deleted the scipy-dev branch February 1, 2024 16:29
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Feb 10, 2024
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Feb 13, 2024
glemaitre pushed a commit that referenced this pull request Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Build / CI module:datasets No Changelog Needed To backport PR merged in master that need a backport to a release branch defined based on the milestone.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

⚠️ CI failed on Linux_Nightly.pylatest_pip_scipy_dev ⚠️

4 participants