ENH Enables array_api for LinearDiscriminantAnalysis #102

thomasjpfan · 2022-01-24T22:54:29Z

Another example of using array_api + scikit-learn's LinearDiscriminantAnalysis.

There is a good speed up 14x speed up when using cupy compared to numpy as seen in this gist.

Most workarounds are similar to the ones in #99.

thomasjpfan

I left some comments highlighting some decision points that are different from #99

thomasjpfan · 2022-01-24T22:55:20Z

sklearn/discriminant_analysis.py

+    if is_array_api:
+        for i in range(classes.shape[0]):
+            means[i, :] = np.mean(X[y == i], axis=0)
+    else:
+        cnt = np.bincount(y)
+        np.add.at(means, y, X)
+        means /= cnt[:, None]


Since array_api do not have np.add.at, we use a for loop to do the same computation.

thomasjpfan · 2022-01-24T22:56:33Z

sklearn/discriminant_analysis.py

+        if is_array_api:
+            svd = np.linalg.svd
+        else:
+            svd = scipy.linalg.svd


This one is a bit interesting. Since I already have a wrapper for numpy: _NumPyApiWrapper, it could make sense to make numpy.scipy == scipy to avoid this conditional.

thomasjpfan · 2022-01-24T22:58:17Z

sklearn/utils/_array_api.py

+    def astype(self, x, dtype, *args, **kwargs):
+        # astype is not defined in the top level numpy namespace
+        return x.astype(dtype, *args, **kwargs)


I wrap numpy because I want np.astype to exist. In a sense, this is making NumPy more like array api. The alternative is to add astype to the array_api.Array object, but that involves patching the object.

thomasjpfan · 2022-01-24T22:59:52Z

sklearn/utils/multiclass.py

+        unique_ys = np.concatenate([_unique_labels(y) for y in ys])
+        return np.unique(unique_ys)

+    ys_labels = set(chain.from_iterable((i for i in _unique_labels(y)) for y in ys))


Array api does not go down this code path, because Arrays are not hashable.

Although using set + arrays feels like an anti pattern.

thomasjpfan · 2022-01-24T23:03:30Z

sklearn/utils/_array_api.py

+    def concatenate(self, arrays, *, axis=0, **kwargs):
+        # ignore parameters that is not supported by array-api
+        f = self._array_namespace.concat
+        return f(arrays, axis=axis)


It's either this or what we see in https://github.com/scipy/scipy/pull/15395/files:

def _concatenate(arrays, axis): xp = _get_namespace(*arrays) if xp is np: return xp.concatenate(arrays, axis=axis) else: return xp.concat(arrays, axis=axis)

And importing _concatenate when needed.

thomasjpfan · 2022-01-24T23:09:13Z

sklearn/utils/_array_api.py

+    @property
+    def VisibleDeprecationWarning(self):
+        return DeprecationWarning


I'm still unsure about how I feel about this workaround

thomasjpfan · 2022-01-25T15:27:37Z

sklearn/discriminant_analysis.py

-            warnings.warn("The priors do not sum to 1. Renormalizing", UserWarning)
-            self.priors_ = self.priors_ / self.priors_.sum()
+
+        # TODO: implement isclose in wrapper?


Need to implement our own isclose.

asmeurer · 2022-01-25T22:30:21Z

sklearn/discriminant_analysis.py

        Class means.
    """
+    np, is_array_api = get_namespace(X)
    classes, y = np.unique(y, return_inverse=True)


unique is not in the array API (this would be unique_inverse).

Oh I see, you are using a wrapper below. IMO it would be better to use the array API APIs in the wrapper and wrap non-compatible NumPy conventions to match them, not the other way around.

Yea, going the other direction also works. I'll try it the other way around and see how it compares.

My guess is that it's fine.

asmeurer · 2022-01-25T22:34:43Z

sklearn/discriminant_analysis.py

+
+    if is_array_api:
+        for i in range(classes.shape[0]):
+            means[i, :] = np.mean(X[y == i], axis=0)


Should this be +=?

Using mean here avoids needing to divide by the count. (In the end, I think it's the same computation. It's basically a groupby + mean aggregation.)

Oh I see, you moved the aggregation from at into the mean calculation.

I think you could get rid of the loop with something like

np.sum(np.where(y[None] == np.arange(classes.shape[0])[:, None], X, np.asarray(0.)), axis=1)/cnt

except None indexing isn't in the spec yet (data-apis/array-api#360), so you'd have to use expand_dims. That's still not as "efficient" as the original because you are adding a lot of redundant 0s in the sum, so depending on how many classes there typically are the loop version might be better anyway (at least in the sense that it's more readable). Another thing to note is that not all array API modules are guaranteed to have boolean indexing (https://data-apis.org/array-api/latest/API_specification/indexing.html#boolean-array-indexing).

Also, I think cnt can be gotten from unique_counts or unique_all in the array API (include_counts in NumPy).

Another thing to note is that not all array API modules are guaranteed to have boolean indexing

Thanks for the information! Some functionality would be hard to reproduce without boolean indexing. Looking into "Data-dependent output shape" more, I see that unique_all may also not work. This is a significant barrier for scikit-learn. Pragmatically, I would have the first version of array API support in scikit-learn to be restricted to array modules that support "Data-dependent output shape".

(I wanted to use nonzero to go from boolean index -> integer index -> take, but that looks to under "Data-dependent output shape")

Oh yeah of course unique wouldn't work in such APIs either. So there's no point in worrying about it here.

thomasjpfan added 5 commits January 23, 2022 17:19

ENH Enables array_api for LinearDiscriminantAnalysis

5e0e6ea

FIX Fixes for cupy

5c9b7c2

CLN Use different path for array_api

222fc69

CLN Reduce diff

a730f92

CLN More clever

8b57acc

thomasjpfan commented Jan 24, 2022

View reviewed changes

thomasjpfan commented Jan 25, 2022

View reviewed changes

asmeurer reviewed Jan 25, 2022

View reviewed changes

thomasjpfan mentioned this pull request Feb 1, 2022

Path for Adopting the Array API spec scikit-learn/scikit-learn#22352

Open

thomasjpfan added 4 commits February 12, 2022 14:21

Merge remote-tracking branch 'upstream/main' into array_api_lda

9406008

CLN Make numpy look like array api

b94f36a

CLN Remove comment

1e0fdc7

CLN Fix spelling

02d4d44

thomasjpfan closed this Jan 9, 2023

ENH Enables array_api for LinearDiscriminantAnalysis #102

ENH Enables array_api for LinearDiscriminantAnalysis #102

Uh oh!

Conversation

thomasjpfan commented Jan 24, 2022

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Jan 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asmeurer Jan 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

thomasjpfan Jan 24, 2022 •

edited

Loading

asmeurer Jan 26, 2022 •

edited

Loading