-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
Description
A fairly common pattern in the scikit-learn code is to create intermediary arrays with dtype=X.dtype. That works well, as long as X = check_arrays(X) was run with a float dtype only, or in other words when X.dtype is not int.
When X.dtype is int, check_array(X) with default parameters will pass it though, and then all intermediary objects will be of dtype int.
For instance,
import numpy as np
from sklearn.cluster import MiniBatchKMeans
X = np.array([[1, 2], [1, 4], [1, 0],
[4, 2], [4, 0], [4, 4],
[4, 5], [0, 1], [2, 2],
[3, 2], [5, 5], [1, -1]])
# manually fit on batches
kmeans = MiniBatchKMeans(n_clusters=2,
random_state=0,
batch_size=6)
kmeans.partial_fit(X[0:6,:])(taken from the MiniBatchKMeans docstring)
will happily run in the integer space where both sample_weight and cluster_centroid_ will be int. Discovered as part of #14307
Another point, that e.g. in linear models, check_array(..., dtype=[np.float64, np.float32]) will only be run if check_input=True (default). Meaning that when check_input=False it might try to create a liner model where coefficients are an array of integers, and unless something fails due to a dtype mismatch the user will never know. I think we should always check that X.dtype is float, even when check_input=False. I believe the point of that flag was to avoid expensive checks and this doesn't cost anything.
In general any code that uses X.dtype to create intermediary arrays should be evaluated as to whether we are sure the dtype is float (or that ints would be acceptable).
Might be a good sprint issue, not sure.