[MRG+1] Added support for sparse multilabel y for Nearest neighbor classifiers #8096

dalmia · 2016-12-21T10:37:21Z

Reference Issue

What does this implement/fix? Explain your changes.

Added support for sparse multilabel y for nearest neighbor classifiers. Firstly, it checks if the input to fit is sparse + multilabel and converts it to a dense one for storing. Also, it stores another parameter indicating whether the original input was sparse + multilabel or not. Now, in predict, if this stored value is true, then it converts y_pred to sparse CSC.
Also, added tests for the same.

jnothman

The point of a sparse datastructure is that if truly sparse, it takes much less memory than the dense equivalent. Imagine we have vary many outputs in a multilabel problem. Storing the n_outputs x n_samples matrix is wasteful. Here you use dense versions as well as sparse versions, so this solution is not helpful at all.

jnothman · 2016-12-21T11:59:24Z

sklearn/neighbors/base.py

+
+        self._issparse = issparse(y)
+        if(issparse(y) and self.outputs_2d_):
+            y = y.toarray()


we should not make a sparse array dense unnecessarily.

jnothman · 2016-12-21T11:59:55Z

sklearn/neighbors/classification.py

                                 for (pl, w)
-                                 in zip(pred_labels[inliers], weights[inliers])],
+                                 in zip(pred_labels[inliers],
+                                 weights[inliers])],


bad indentation

dalmia · 2016-12-21T12:39:43Z

@jnothman Thank you for giving me such a clear explanation. I'll think of some other way to implement the same.

dalmia · 2016-12-21T13:49:39Z

When a sparse multilabel y is passed to fit:

classes, self._y[:, k] = np.unique(y[:, k], return_inverse=True)

The above line creates a lot of problems:

>>> np.unique(y_train[:, 3], return_inverse=True)
(array([ <75x1 sparse matrix of type '<type 'numpy.int64'>'
 	with 36 stored elements in Compressed Sparse Row format>], dtype=object),
 array([0]))

Thereby assigning classes the value of the particular column of y_train and thus, the value of y is not stored as self._y. For the sparse multilabel case, I am thus trying to do something else separately.

dalmia · 2016-12-21T16:40:06Z

Thought of something. It was while implementing, that I clearly understood what you had mentioned while introducing the issue. Learnt something new today :)

dalmia · 2016-12-21T17:56:44Z

Travis is showing some unexpected error. Tests pass locally.

aman@aman:/media/aman/BE66ECBA66EC7515/Open Source/scikit-learn$ nosetests sklearn/neighbors/tests/test_neighbors.py 
...............................................
----------------------------------------------------------------------
Ran 47 tests in 2.668s

OK

jnothman · 2016-12-21T20:30:27Z

Yes, that code won't work neatly for both sparse and dense

…

On 22 December 2016 at 00:49, Aman Dalmia ***@***.***> wrote: When a sparse multilabel y is passed to fit: classes, self._y[:, k] = np.unique(y[:, k], return_inverse=True) The above line creates a lot of problems: >>> np.unique(y_train[:, 3], return_inverse=True) (array([ <75x1 sparse matrix of type '<type 'numpy.int64'>' with 36 stored elements in Compressed Sparse Row format>], dtype=object), array([0])) Thereby assigning classes the value of the particular column of y_train and thus, the value of y is not stored as self._y. For the sparse multilabel case, I am thus trying to do something else separately. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8096 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz62l-MJwdEZNX8cehMZQyPUUT3OlBks5rKS5zgaJpZM4LSzU6> .

jnothman · 2016-12-21T20:31:38Z

But you can use np.unique() on subsets of data in a CSC. Go read how CSC works internally and see if you can puzzle it out.

…

On 22 December 2016 at 07:30, Joel Nothman ***@***.***> wrote: Yes, that code won't work neatly for both sparse and dense On 22 December 2016 at 00:49, Aman Dalmia ***@***.***> wrote: > When a sparse multilabel y is passed to fit: > > classes, self._y[:, k] = np.unique(y[:, k], return_inverse=True) > > The above line creates a lot of problems: > > >>> np.unique(y_train[:, 3], return_inverse=True) > (array([ <75x1 sparse matrix of type '<type 'numpy.int64'>' > with 36 stored elements in Compressed Sparse Row format>], dtype=object), > array([0])) > > Thereby assigning classes the value of the particular column of y_train > and thus, the value of y is not stored as self._y. For the sparse > multilabel case, I am thus trying to do something else separately. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#8096 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AAEz62l-MJwdEZNX8cehMZQyPUUT3OlBks5rKS5zgaJpZM4LSzU6> > . >

dalmia · 2016-12-22T05:23:09Z

@jnothman Yes, I had spent the whole of yesterday figuring out what was happening behind the scenes. I think it might have missed your notice, I have indeed made another commit fixing the issue. Now, I am not using conversion to dense at any point of time and I think might just work. Please review.

jnothman · 2016-12-27T01:49:35Z

Travis says you're constructing a matrix with data, indices, or indptr that is not 1-dimensional.

jnothman

What about RadiusNeighborsClassifier?

jnothman · 2016-12-27T03:00:46Z

sklearn/neighbors/base.py

-            self.classes_.append(classes)
+        if self._issparse and self.outputs_2d_:
+            self._y = y
+            for k in range(self._y.shape[1]):


just self.classes_ = [np.array([0, 1], dtype=np.int)] * range(y.shape[1])?

I don't think range should be used here. Other than that, this is much better. Thanks.

jnothman · 2016-12-27T03:08:46Z

sklearn/neighbors/classification.py

+        if self._issparse:
+            y_pred_sparse = []
+
        for k, classes_k in enumerate(classes_):


I think you want two separate code-paths for this routine. Putting the condition inside the loop makes the code pretty messy.

jnothman · 2016-12-27T03:10:03Z

sklearn/neighbors/classification.py

+
+            if self._issparse and self.outputs_2d_:
+                if weights is None:
+                    mode, _ = stats.mode(_y[neigh_ind, k].toarray(), axis=1)


It might be interesting to implement a sparse (weighted) mode, but this is fine for now. Requires memory O(n_samples x n_neighbors) where sparse is being used to avoid memory O(n_samples x n_outputs)

dalmia · 2016-12-29T08:18:46Z

I thought that once we have a layout for KNeighborsClassifier set, then we can extend it to RadiusNeighborClassifier more gracefully.

jnothman · 2016-12-29T11:27:16Z

sklearn/neighbors/base.py


        check_classification_targets(y)
+
+        self._issparsemultilabel = issparse(y) and self.outputs_2d_


I don't get why this should be stored

Or perhaps we should store outputs_2d_ = 'sparse'

Yes, we could store that in this way. Does save the need of using another variable.

jnothman · 2016-12-29T11:34:17Z

sklearn/neighbors/classification.py

            y_pred = y_pred.ravel()

+        if self._issparsemultilabel:
+            y_pred = hstack(y_pred_sparse_multilabel)


do this in the if/else above

jnothman · 2016-12-29T11:34:45Z

sklearn/neighbors/classification.py

+                mode = np.asarray(mode.ravel(), dtype=np.intp)
+                y_pred[:, k] = classes_k.take(mode)

        if not self.outputs_2d_:


put this in the else clause

jnothman · 2016-12-29T11:36:44Z

sklearn/neighbors/classification.py

        n_samples = X.shape[0]
        weights = _get_weights(neigh_dist, self.weights)

        y_pred = np.empty((n_samples, n_outputs), dtype=classes_[0].dtype)


you shouldn't be constructing this dense output array. this defies the purpose of using a sparse (low-memory) structure.

dalmia · 2017-01-04T03:31:08Z

The nosetests run fine locally. Could you please tell me as to why do they differ here?

jnothman · 2017-01-04T08:09:53Z

From the look of which tests are failing (not checked the log as I'm on a slow connection), could it be that your code is not Python 2-friendly?

dalmia · 2017-01-04T10:13:25Z

That shouldn't be the case as I ran my tests on Python2.7.

dalmia · 2017-01-07T03:59:57Z

The problem here is related to indexing in sparse matrix. But I've been unable to reproduce the error, so it's a bit hard to debug it.

jnothman · 2017-01-07T13:28:14Z

sklearn/neighbors/classification.py

-        if not self.outputs_2d_:
-            y_pred = y_pred.ravel()
+                if weights is None:
+                    mode, _ = stats.mode(_y[neigh_ind, k].toarray(), axis=1)


could you please confirm what neigh_ind and k are exactly so that we can debug what's not working?

neigh_ind here is the list of indices of the kneighbors and k signifies the kth column of the y passed to fit.

jnothman · 2017-01-07T21:49:15Z

One of the issues may derive from `neigh_ind` having 0 elements, and that case not being handled in scipy (indeed, we may still support a version of scipy where sparse matrices with 0 elements on one axis were not supported at all).

…

On 8 January 2017 at 01:09, Aman Dalmia ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In sklearn/neighbors/classification.py <#8096>: > - if not self.outputs_2d_: - y_pred = y_pred.ravel() + if weights is None: + mode, _ = stats.mode(_y[neigh_ind, k].toarray(), axis=1) neigh_ind here is the list of indices of the kneighbors and k signifies the kth column of the y passed to fit. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8096>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6xkcCndOpDTn9-WDyj2j39LJP8VWks5rP5yagaJpZM4LSzU6> .

dalmia · 2017-01-08T02:50:27Z

Yes, that is what is indeed creating these errors. I logged the value of neigh_ind appearing in the tests and found quite a few of them to be empty.

dalmia · 2017-01-08T02:53:20Z

Any workaround that you may for getting it to work?

dalmia · 2017-01-09T07:19:18Z

Added failure for scipy<0.13. Please review.

jnothman

I think there should be a way to avoid so much repetition in the code but still have it elegant... I'm just too tired to suggest something specific. :|

jnothman · 2017-01-09T11:04:48Z

sklearn/neighbors/classification.py

-                mode, _ = weighted_mode(_y[neigh_ind, k], weights, axis=1)
+        if self.outputs_2d_ == 'sparse':
+            if StrictVersion(version.version) < StrictVersion('0.13.0'):
+                raise EnvironmentError('Sparse multilabel y passed in fit. '


Do this in fit.

jnothman · 2017-01-09T11:06:12Z

sklearn/neighbors/tests/test_neighbors.py

+    clf.fit(X_train, y_train)
+
+    if (name == 'KNeighborsClassifier' and
+            StrictVersion(version.version) < StrictVersion('0.13.0')):


StrictVersion(scipy.__version__) would be clearer

dalmia · 2017-01-09T14:47:57Z

I'll try to come up with something then.

dalmia · 2017-01-09T16:04:25Z

Added calculation for mode in a function which removes the repetitive code greatly. Please have a look.

jnothman

yes, making _mode a helper is clearer.

jnothman · 2017-01-10T03:40:33Z

sklearn/neighbors/classification.py


-        if not self.outputs_2d_:
-            y_pred = y_pred.ravel()
+            y_pred = hstack(y_pred_sparse_multilabel)


I'd rather see this as sparse.hstack. Thus from scipy import sparse

jnothman · 2017-01-10T03:42:43Z

sklearn/neighbors/classification.py

+                y_pred_k[inliers] = classes_k.take(mode)

-            y_pred[inliers, k] = classes_k.take(mode)
+                if outliers:


Make this if outliers and self.outlier_label != 0

jnothman · 2017-01-10T03:43:29Z

sklearn/neighbors/classification.py

+                pred_labels = np.array([_y[ind, k].toarray()
+                                        for ind in neigh_ind[inliers]],
+                                       dtype=object)
+                y_pred_k = np.zeros(n_samples)


Use an integer dtype, please

jnothman · 2017-01-10T03:45:13Z

sklearn/neighbors/classification.py

+                                       dtype=object)
+                y_pred_k = np.zeros(n_samples)
+                mode = self._mode(pred_labels, weights, inliers)
+                y_pred_k[inliers] = classes_k.take(mode)


I think in the multilabel case, classes should be [0, 1]... I might be wrong.

jnothman · 2017-01-10T03:47:06Z

sklearn/neighbors/classification.py


-        if outliers:
-            y_pred[outliers, :] = self.outlier_label
+                y_pred_sparse_multilabel.append(csc_matrix(y_pred_k).T)


this doesn't make sense in terms of efficient data structures. I think you mean csr_matrix(y_pred_k).T, or you mean csc_matrix(y_pred_k.reshape(-1, 1))

jnothman · 2017-01-10T03:48:09Z

sklearn/neighbors/classification.py

+            for k, classes_k in enumerate(classes_):
+                mode = self._mode(_y[neigh_ind, k].toarray(), weights)
+                y_pred_sparse_multilabel.append(
+                    csc_matrix(classes_k.take(mode)).T)


I think in the multilabel case, classes should be [0, 1]... I might be wrong.

this doesn't make sense in terms of efficient data structures. I think you mean csr_matrix(mode).T, or you mean csc_matrix(mode.reshape(-1, 1))

Oh yes, I missed that.

jnothman · 2017-01-10T03:49:29Z

sklearn/neighbors/classification.py

+                             in zip(pred_labels, weights[inliers])],
+                            dtype=np.int)
+
+        mode = mode.ravel()


why is this necessary?

The mode returned by the operation above it is of shape (len(inliers),1).

dalmia · 2017-02-06T21:08:59Z

ping @jnothman

niedakh · 2017-05-20T21:33:06Z

i'd like to upvote merging this, I could use more of this support as scikit-multilearn can feed sparse matrices to scikit-learn.

jnothman

Some very minor changes.

I suppose I'll see if someone else wants to wrap this up.

jnothman · 2017-06-08T09:37:50Z

sklearn/neighbors/classification.py

-            y_pred[:, k] = classes_k.take(mode)
+            for k, classes_k in enumerate(classes_):
+                mode = self._mode(_y[neigh_ind, k].toarray(), weights)
+                y_pred_sparse_multilabel.append(csr_matrix(mode).T)


I think it makes more sense to construct a CSC directly from the reshaped array.

jnothman · 2017-06-08T09:37:52Z

sklearn/neighbors/base.py

-            self._y = self._y.ravel()
+
+        if issparse(y) and self.outputs_2d_:
+            if StrictVersion(scipy.__version__) < StrictVersion('0.13.0'):


We no longer support the broken scipy

jnothman · 2017-06-08T09:37:54Z

sklearn/neighbors/base.py

+                                       'scipy < 0.13')
+            self.outputs_2d_ = 'sparse'
+            self._y = y
+            self.classes_ = [np.array([0, 1], dtype=np.int)] * y.shape[1]


Perhaps we should check that the data really doesn't contain other values.

jnothman · 2017-06-08T10:00:44Z

sklearn/neighbors/classification.py


-        if outliers:
-            y_pred[outliers, :] = self.outlier_label
+                y_pred_sparse_multilabel.append(csr_matrix(y_pred_k).T)


same here: construct CSC from reshaped array

jnothman · 2017-06-08T10:01:27Z

sklearn/neighbors/tests/test_neighbors.py

+    y_sparse = csc_matrix(y)
+    X_train, X_test, y_train, y_test = train_test_split(X, y_sparse,
+                                                        random_state=0)
+    if StrictVersion(scipy.__version__) < StrictVersion('0.13.0'):


dorcoh · 2018-02-17T17:17:30Z

Hi @jnothman , can I finish this?
I'll be glad if you can describe shortly the changes needed

jnothman · 2018-02-17T21:36:06Z

Thanks Dor

It could do with those small changes I request above, but then it will also need a second review before merge. The code changes aren't supremely readable, so any improvements to code quality would help with a second review

dorcoh · 2018-02-17T22:47:47Z

Hi, I just noticed there's another PR for that issue : #9059
Should this one be closed then?

jnothman · 2018-02-18T00:32:16Z

indeed our record keeping is pretty terrible

dalmia added 2 commits December 21, 2016 15:57

ENH: added support for sparse multilabel y

ba441ca

FIX: fix typo

b868b0d

jnothman reviewed Dec 21, 2016

View reviewed changes

FIX: fixed indentation error

c359f58

ENH: avoid dense conversion for support

75c849b

jnothman reviewed Dec 27, 2016

View reviewed changes

ENH: adding separate codepath for sparse multilabel case

ab6e142

jnothman reviewed Dec 29, 2016

View reviewed changes

jnothman changed the title ~~Added support for sparse multilabel y for Nearest neighbor classifiers~~ [WIP] Added support for sparse multilabel y for Nearest neighbor classifiers Dec 29, 2016

ENH: added sparsity check to existing variable

dc3883e

ENH: added support for RadiusNeighborClassifier

997c5c5

FIX: removed unused import

f71e086

FIX: replaced csr_matrix with csc_matrix in tests

09734f9

jnothman reviewed Jan 7, 2017

View reviewed changes

dalmia force-pushed the 8057 branch from dcad22f to b865bda Compare January 9, 2017 05:14

FIX: remove flake8 errors

4b72484

jnothman reviewed Jan 9, 2017

View reviewed changes

FIX: moved version check to fit

cc280d1

dalmia added 2 commits January 9, 2017 21:08

TST: removed unrequired brackets

9643024

FIX: added function to calculate mode

0171423

jnothman reviewed Jan 10, 2017

View reviewed changes

dalmia added 2 commits January 10, 2017 23:38

FIX: fixes based on review

3593eb1

FIX: removed unused imports

8fc4044

dalmia changed the title ~~[WIP] Added support for sparse multilabel y for Nearest neighbor classifiers~~ [MRG] Added support for sparse multilabel y for Nearest neighbor classifiers Feb 20, 2017

jnothman approved these changes Jun 8, 2017

View reviewed changes

jnothman added the Need Contributor label Jun 8, 2017

jnothman mentioned this pull request Jun 8, 2017

[MRG + 1] FIX Avoid unintentionally converting numeric data to dtype=object #9049

Merged

massich mentioned this pull request Jun 8, 2017

[MRG+1] Added support for sparse multilabel y for Nearest neighbor classifiers #9059

Open

jnothman mentioned this pull request Jun 12, 2017

[MRG] Estimator tags #8022

Merged

4 tasks

lesteve added help wanted and removed Need Contributor labels Oct 18, 2017

jnothman changed the title ~~[MRG] Added support for sparse multilabel y for Nearest neighbor classifiers~~ [MRG+1] Added support for sparse multilabel y for Nearest neighbor classifiers Jan 16, 2018

jnothman added the Stalled label Feb 6, 2018

jnothman closed this Feb 17, 2018


		check_classification_targets(y)

		self._issparsemultilabel = issparse(y) and self.outputs_2d_

Uh oh!

[MRG+1] Added support for sparse multilabel y for Nearest neighbor classifiers #8096

[MRG+1] Added support for sparse multilabel y for Nearest neighbor classifiers #8096

Uh oh!

Conversation

dalmia commented Dec 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issue

What does this implement/fix? Explain your changes.

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dalmia commented Dec 21, 2016

Uh oh!

dalmia commented Dec 21, 2016

Uh oh!

dalmia commented Dec 21, 2016

Uh oh!

dalmia commented Dec 21, 2016

Uh oh!

jnothman commented Dec 21, 2016 via email

Uh oh!

jnothman commented Dec 21, 2016 via email

Uh oh!

dalmia commented Dec 22, 2016

Uh oh!

jnothman commented Dec 27, 2016

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dalmia commented Dec 29, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dalmia commented Jan 4, 2017

Uh oh!

jnothman commented Jan 4, 2017

Uh oh!

dalmia commented Jan 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dalmia commented Jan 7, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Jan 7, 2017 via email

Uh oh!

dalmia commented Jan 8, 2017

Uh oh!

dalmia commented Jan 8, 2017

Uh oh!

dalmia commented Jan 9, 2017

Uh oh!

jnothman left a comment

dalmia commented Dec 21, 2016 •

edited

Loading

dalmia commented Jan 4, 2017 •

edited

Loading

dalmia Jan 10, 2017 •

edited

Loading