-
Notifications
You must be signed in to change notification settings - Fork 3
ENH Enables array_api for LinearDiscriminantAnalysis #102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
thomasjpfan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments highlighting some decision points that are different from #99
| if is_array_api: | ||
| for i in range(classes.shape[0]): | ||
| means[i, :] = np.mean(X[y == i], axis=0) | ||
| else: | ||
| cnt = np.bincount(y) | ||
| np.add.at(means, y, X) | ||
| means /= cnt[:, None] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since array_api do not have np.add.at, we use a for loop to do the same computation.
| if is_array_api: | ||
| svd = np.linalg.svd | ||
| else: | ||
| svd = scipy.linalg.svd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is a bit interesting. Since I already have a wrapper for numpy: _NumPyApiWrapper, it could make sense to make numpy.scipy == scipy to avoid this conditional.
sklearn/utils/_array_api.py
Outdated
| def astype(self, x, dtype, *args, **kwargs): | ||
| # astype is not defined in the top level numpy namespace | ||
| return x.astype(dtype, *args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrap numpy because I want np.astype to exist. In a sense, this is making NumPy more like array api. The alternative is to add astype to the array_api.Array object, but that involves patching the object.
| unique_ys = np.concatenate([_unique_labels(y) for y in ys]) | ||
| return np.unique(unique_ys) | ||
|
|
||
| ys_labels = set(chain.from_iterable((i for i in _unique_labels(y)) for y in ys)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Array api does not go down this code path, because Arrays are not hashable.
Although using set + arrays feels like an anti pattern.
sklearn/utils/_array_api.py
Outdated
| def concatenate(self, arrays, *, axis=0, **kwargs): | ||
| # ignore parameters that is not supported by array-api | ||
| f = self._array_namespace.concat | ||
| return f(arrays, axis=axis) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's either this or what we see in https://github.com/scipy/scipy/pull/15395/files:
def _concatenate(arrays, axis):
xp = _get_namespace(*arrays)
if xp is np:
return xp.concatenate(arrays, axis=axis)
else:
return xp.concat(arrays, axis=axis)And importing _concatenate when needed.
sklearn/utils/_array_api.py
Outdated
| @property | ||
| def VisibleDeprecationWarning(self): | ||
| return DeprecationWarning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still unsure about how I feel about this workaround
| warnings.warn("The priors do not sum to 1. Renormalizing", UserWarning) | ||
| self.priors_ = self.priors_ / self.priors_.sum() | ||
|
|
||
| # TODO: implement isclose in wrapper? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to implement our own isclose.
sklearn/discriminant_analysis.py
Outdated
| Class means. | ||
| """ | ||
| np, is_array_api = get_namespace(X) | ||
| classes, y = np.unique(y, return_inverse=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unique is not in the array API (this would be unique_inverse).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, you are using a wrapper below. IMO it would be better to use the array API APIs in the wrapper and wrap non-compatible NumPy conventions to match them, not the other way around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, going the other direction also works. I'll try it the other way around and see how it compares.
My guess is that it's fine.
sklearn/discriminant_analysis.py
Outdated
|
|
||
| if is_array_api: | ||
| for i in range(classes.shape[0]): | ||
| means[i, :] = np.mean(X[y == i], axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be +=?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using mean here avoids needing to divide by the count. (In the end, I think it's the same computation. It's basically a groupby + mean aggregation.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, you moved the aggregation from at into the mean calculation.
I think you could get rid of the loop with something like
np.sum(np.where(y[None] == np.arange(classes.shape[0])[:, None], X, np.asarray(0.)), axis=1)/cnt
except None indexing isn't in the spec yet (data-apis/array-api#360), so you'd have to use expand_dims. That's still not as "efficient" as the original because you are adding a lot of redundant 0s in the sum, so depending on how many classes there typically are the loop version might be better anyway (at least in the sense that it's more readable). Another thing to note is that not all array API modules are guaranteed to have boolean indexing (https://data-apis.org/array-api/latest/API_specification/indexing.html#boolean-array-indexing).
Also, I think cnt can be gotten from unique_counts or unique_all in the array API (include_counts in NumPy).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing to note is that not all array API modules are guaranteed to have boolean indexing
Thanks for the information! Some functionality would be hard to reproduce without boolean indexing. Looking into "Data-dependent output shape" more, I see that unique_all may also not work. This is a significant barrier for scikit-learn. Pragmatically, I would have the first version of array API support in scikit-learn to be restricted to array modules that support "Data-dependent output shape".
(I wanted to use nonzero to go from boolean index -> integer index -> take, but that looks to under "Data-dependent output shape")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah of course unique wouldn't work in such APIs either. So there's no point in worrying about it here.
Another example of using
array_api+ scikit-learn's LinearDiscriminantAnalysis.There is a good speed up 14x speed up when using cupy compared to numpy as seen in this gist.
Most workarounds are similar to the ones in #99.