[MRG+1] Added support for sparse multilabel y for Nearest neighbor classifiers #9059

massich · 2017-06-08T12:51:49Z

Reference Issue

What does this implement/fix? Explain your changes.

Added support for sparse multilabel y for nearest neighbor classifiers. Firstly, it checks if the input to fit is sparse + multilabel and converts it to a dense one for storing. Also, it stores another parameter indicating whether the original input was sparse + multilabel or not. Now, in predict, if this stored value is true, then it converts y_pred to sparse CSC.
Also, added tests for the same.

Any other comments?

This PR wraps up @dalmia 's work in #8096

massich · 2017-06-08T12:52:47Z

sklearn/neighbors/base.py

+        if issparse(y) and self.outputs_2d_:
+            self.outputs_2d_ = 'sparse'
+            self._y = y
+            self.classes_ = [np.array([0, 1], dtype=np.int)] * y.shape[1]


@jnothman where )you asking for something like self.classes_ = set(y) ?

I just mean that, even if (for no particularly good reason) we only support y being sparse when it is multilabel (not general multioutput) someone might get a surprise by this assumption that it's all binary.

You can't just do set(y). if y is CSC or CSR, checking that np.all(np.union1d(y.data, [0, 1]) == [0, 1]) and raising an error if it fails might help someone unsuspecting.

but it may not be necessary. if you fix it, it needs a test.

massich · 2017-06-08T12:53:43Z

sklearn/neighbors/classification.py

+                             in zip(pred_labels, weights[inliers])],
+                            dtype=np.int)
+
+        mode = mode.ravel()


@jnothman is mode = mode.ravel() right ?

yes, it's fine.

Now it's not clear to me why it's necessary, i.e. why mode is not already 1d

ravel() changes the shape of mode from (_,1) to (_,). Removing it brakes lots of tests. Shall we remove it and fix the tests?

I would be careful here. We should document that input should always be 2D (to use axis=0 from the mode function` and ravel will allow to always return 1D array.

jnothman · 2017-06-08T13:06:45Z

sklearn/neighbors/classification.py

            for k, classes_k in enumerate(classes_):
                mode = self._mode(_y[neigh_ind, k].toarray(), weights)
-                y_pred_sparse_multilabel.append(csr_matrix(mode).T)
+                y_pred_sparse_multilabel.append(csc_matrix(mode).T)


This isn't right. You don't want to transpose the CSC. You want to build a CSC on the reshaped mode. Transposing on a numpy array is very cheap. Transposing a scipy.sparse matrix is more expensive (except for COO format).

I'd hope that tests don't pass...

Cool y_pred_sparse_multilabel.append(csc_matrix(mode.T))

Which test in sklearn/cluster/tests/ should break when using csc_matrix(mode).T? I run nosetests sklearn/cluster/ and all tests pass. Shall we add a regression test somwhere?

jnothman · 2017-06-08T13:09:48Z

sklearn/neighbors/base.py

+        if issparse(y) and self.outputs_2d_:
+            self.outputs_2d_ = 'sparse'
+            self._y = y
+            self.classes_ = [np.array([0, 1], dtype=np.int)] * y.shape[1]


I just mean that, even if (for no particularly good reason) we only support y being sparse when it is multilabel (not general multioutput) someone might get a surprise by this assumption that it's all binary.

You can't just do set(y). if y is CSC or CSR, checking that np.all(np.union1d(y.data, [0, 1]) == [0, 1]) and raising an error if it fails might help someone unsuspecting.

massich · 2017-06-08T14:54:17Z

sklearn/neighbors/tests/test_neighbors.py

+    for name in CLASSIFIERS:
+        yield check_classifier_sparse_multilabel_y, name
+
+def test_sparse_multilabel_y():


@jnothman is this the test you meant?

Remind me where I mentioned a test

@jnothman here

jnothman · 2017-06-08T23:23:03Z

i don't think mode.T will actually work because mode is 1d

…

On 9 Jun 2017 12:54 am, "Joan Massich" ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In sklearn/neighbors/tests/test_neighbors.py <#9059 (comment)> : > + clf.fit(X_train, y_train) + y_pred = clf.predict(X_test) + + y_sparse = csc_matrix(y) + X_train, X_test, y_train, y_test = train_test_split(X, y_sparse, + random_state=0) + clf.fit(X_train, y_train) + y_sparse_pred = clf.predict(X_test) + assert_array_equal(y_pred, y_sparse_pred.toarray()) + + +def test_classifiers_sparse_multilabel_y(): + for name in CLASSIFIERS: + yield check_classifier_sparse_multilabel_y, name + +def test_sparse_multilabel_y(): @jnothman <https://github.com/jnothman> is this the test you meant? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9059 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz69sr9kBizi_tTDz8g-GZpNQTfG6kks5sCAsmgaJpZM4N0Bnb> .

jnothman · 2017-06-14T13:16:47Z

sklearn/neighbors/tests/test_neighbors.py

+
+    clf = neighbors.KNeighborsClassifier(n_neighbors=3)
+    assert_raises_regex(ValueError,
+                        "Sparse y is only supported for multilabel")


You need to call the function here, i.e.

assert_raises_regex(ValueError, "Sparse y is only supported for multilabel", clf.fit, csr_matrix(X), y)

jnothman · 2017-06-14T13:16:55Z

sklearn/neighbors/tests/test_neighbors.py

+def test_sparse_multilabel_y():
+    rng = check_random_state(0)
+    n_features = 2
+    n_samples = 100


It doesn't need to be this many.

jnothman · 2017-06-14T13:17:27Z

sklearn/neighbors/tests/test_neighbors.py

+    n_output = 3
+
+    X = rng.rand(n_samples, n_features)
+    y = rng.randint(0, 5, (n_samples, n_output))


It doesn't need to be random. If it is random, we'd rather a fixed random_state: on occasion this could generate a binary array

I thought that this was taken care by rng = check_random_state(0). I moved the line closer for clarity

Indeed it was, thanks.

glemaitre

without checking the tests for the moment

glemaitre · 2017-09-01T09:54:53Z

doc/whats_new.rst


+- Add support for sparse multilabel ``y`` in :class:`NeighborsBase`
+  :issue:`8057` by :user:`Aman Dalmia <dalmia>`, :user:`Joan Massich <massich>`.
 - Added :class:`naive_bayes.ComplementNB`, which implements the Complement


Add line and you probably have to move it in 0.20

it is under 0.20

glemaitre · 2017-09-01T10:02:05Z

sklearn/neighbors/base.py

-        check_classification_targets(y)
+        try:
+            check_classification_targets(y)
+        except ValueError as e:


Probably add a comment to explain that this is a specific case

glemaitre · 2017-09-01T10:21:05Z

sklearn/neighbors/classification.py

-        if not self.outputs_2d_:
-            y_pred = y_pred.ravel()
+            # Old versions of scipy hstack returns COO formatted matrix
+            y_pred = sparse.hstack(y_pred_sparse_multilabel).tocsc()


sparse.hstack is taking a format parameter:

sparse.hstack(y_pred_sparse_multilabel, format='csc')

note that this is for readability, the performance will be the same

glemaitre · 2017-09-01T10:36:11Z

sklearn/neighbors/classification.py

-        if not self.outputs_2d_:
-            y_pred = y_pred.ravel()
+            # Old versions of scipy hstack returns COO formatted matrix
+            y_pred = sparse.hstack(y_pred_sparse_multilabel).tocsc()


use format

glemaitre · 2017-09-01T11:09:48Z

sklearn/neighbors/classification.py


-            mode = mode.ravel()
+            for k, classes_k in enumerate(classes_):
+                pred_labels = np.array([_y[ind, k].toarray()


why this is converted to object. if we have a sparse matrix, we have for sure numerical values.
Then we don't even have to construct a dense matrix. we can just pass *.data from the sparse matrix

In this case we need also to slice the weights

So this is extracting a single column for all of the neighbors of each query. We are making it dense here, at a cost of (n_samples * avg_neighbors) memory. The sparse representation is to avoid a dense matrix of (n_samples * n_classes) where n_classes is presumed to be large. Yes, we can avoid making this dense, but I don't think it's especially problematic as long as avg_neighbours << n_classes.

We should be careful about constructing arrays of arrays/lists though. When the lengths of the arrays are all the same, this becomes a 2d array whose elements are ints, but still of dtype object due to this constructor. The safe way to do such construction is with np.empty(n, dtype='O') and then fill the array. But I don't see why pred_labels should be an array here at all.

Now that I've clarified that to myself, I'll go see what massich#5 has to say.

Conclusion? we materialize the matrix here and later we can modify the mode computation propagating the sparse matrix later if needed.
If so, we could use mode = 1 if ( np.product(x.shape) - x.nnz ) < x.nnz else 0

And regarding the dtype=object, shall we put an assert that breaks when all the elements in pred_labels have the same length? Just to inform that this is an unlikely corner case. And we open an Issue to fix it at some point?

We should be careful about constructing arrays of arrays/lists though.

That is correct, so in the unlikely cornercase that each sample has the same number of neighbours, the current code is probably broken.

But I don't see why pred_labels should be an array here at all.

The reason is because further down below, we use fancy numpy indexing with a list on this array, which does not work for a list (in pred_labels[inliers]). That can of course be converted to list comprehension, but might have performance bottleneck (not sure this is the case for object array)

Actually, the zip(pred_labels[inliers], weights[inliers])] works fine as well if it is a 2D array instead of 1D object array? So I think we can leave this as is.

glemaitre · 2017-09-01T11:11:07Z

sklearn/neighbors/classification.py

+        else:
+            y_pred = np.empty((n_samples, n_outputs), dtype=classes_[0].dtype)
+            for k, classes_k in enumerate(classes_):
+                pred_labels = np.array([_y[ind, k] for ind in neigh_ind],


I have the same concern than ealier regarding the sparse matrix

this one is not sparse.

glemaitre · 2017-09-01T11:16:08Z

sklearn/neighbors/classification.py

+
+    def _mode(self, pred_labels, weights, inliers):
+        if weights is None:
+            mode = np.array([stats.mode(pl)[0]


why using a list comprehension and not using axis option from stats.mode or weighted_mode?

glemaitre · 2017-09-01T11:20:11Z

sklearn/neighbors/classification.py

+                             in zip(pred_labels, weights[inliers])],
+                            dtype=np.int)
+
+        mode = mode.ravel()


I would be careful here. We should document that input should always be 2D (to use axis=0 from the mode function` and ravel will allow to always return 1D array.

glemaitre · 2017-09-01T11:20:31Z

sklearn/neighbors/classification.py


        return y_pred
+
+    def _mode(self, pred_labels, weights, inliers):


move the function above it is used.

I would not pass inliers because we take inliers from weights but not from pred_labels. It seems inconsistant.

we should have either pred_labels[inliers], weight[inliers] or pred_inliers, weight_inliers passed as argument.
I am more in favor of the second because it will be easier with sparse matrices.

massich · 2017-09-14T12:14:05Z

When addressing this comment by @glemaitre, I stumble into some problems. I don't really know how to post a concrete question, so I collected my thoughts here as a PR to this PR and I would love to get some feedback. Thx.

cc: @jnothman, @lesteve

@ogrisel

The initial idea with @ogrisel was to change the ValueError's meassage for something more descriptive like this: ```py raise ValueError("Unknown classification label type. Got: %r" % y) ``` The main problem is that such message breaks this: ```sh $pytest sklearn/tests/test_common.py::test_non_meta_estimators ```

massich · 2017-09-15T15:27:58Z

sklearn/neighbors/base.py

+                                 " supported). Got: %r" % y)
            else:
-                raise
+                raise ValueError("Unknown label type: %r" % y)


The initial idea with @ogrisel was to change the ValueError's meassage for something more descriptive like this:

raise ValueError("Unknown classification label type. Got: %r" % y)

The main problem is that such message breaks this:

$pytest sklearn/tests/test_common.py::test_non_meta_estimators

If needed I'll change the message in a separated PR, so that no common files are changed here.

lesteve · 2017-09-19T11:42:02Z

doc/whats_new.rst

@@ -4,6 +4,5758 @@
 ===============
 Release History
 ===============
+
+Version 0.20 (under development)


You want to add your entry to v0.20.0.rst (or something like this) rather than change whats_new.rst.

glemaitre · 2017-11-24T00:02:02Z

@massich Could you solve the conflict and address the remaining comments

massich · 2018-03-15T13:58:42Z

@dorcoh feel free to take over this one! let me know if you need anything

ogrisel · 2018-06-06T12:37:49Z

sklearn/neighbors/classification.py

+            mode, _ = weighted_mode(neigh, weights, axis=1)
+
+        mode = np.asarray(mode.ravel(), dtype=np.intp)
+        return mode


Style:

return np.asarray(mode.ravel(), dtype=np.intp)

dalmia and others added 20 commits June 8, 2017 14:19

ENH: added support for sparse multilabel y

d146a75

FIX: fix typo

f6c0fa0

FIX: fixed indentation error

c0d4733

ENH: avoid dense conversion for support

d1708f6

ENH: adding separate codepath for sparse multilabel case

835704d

ENH: added sparsity check to existing variable

e065936

ENH: added support for RadiusNeighborClassifier

1754ee6

FIX: removed unused import

acb9dff

FIX: replaced csr_matrix with csc_matrix in tests

6358de3

TST: use assert_array_equal

2030101

FIX: replaced neigh_ind with inliers

889bb98

FIX: using only inliers for calculating labels

382e03c

FIX: added failure for scipy < 0.13

64d7927

FIX: remove flake8 errors

99c588a

FIX: moved version check to fit

8f3f97f

TST: removed unrequired brackets

f03732f

FIX: added function to calculate mode

4b512b2

FIX: fixes based on review

ab9bce9

FIX: removed unused imports

543417d

Addressed SOME of joel's comments

19768c7

massich commented Jun 8, 2017

View reviewed changes

jnothman reviewed Jun 8, 2017

View reviewed changes

jnothman added the Waiting for Reviewer label Jun 8, 2017

Address jnothman comments

db43938

massich commented Jun 8, 2017

View reviewed changes

jnothman mentioned this pull request Jun 12, 2017

[MRG] Estimator tags #8022

Merged

4 tasks

massich changed the title ~~[WIP] Added support for sparse multilabel y for Nearest neighbor classifiers~~ [MRG] Added support for sparse multilabel y for Nearest neighbor classifiers Jun 14, 2017

jnothman reviewed Jun 14, 2017

View reviewed changes

jnothman added this to the 0.19 milestone Jun 14, 2017

Merge branch 'master' into 8057

5a11bd0

glemaitre requested changes Sep 1, 2017

View reviewed changes

Merge branch 'master' into 8057_sklearn_pr

c4a47e3

massich mentioned this pull request Sep 14, 2017

wip massich/scikit-learn#5

Open

massich force-pushed the 8057 branch 2 times, most recently from 2192eb7 to c4a47e3 Compare September 15, 2017 13:19

Joan Massich added 2 commits September 15, 2017 15:26

Improve hstack call to make it more radable

cdac85e

Move _mode function before its called

c3aa9df

massich force-pushed the 8057 branch from 7b6e1e4 to c3aa9df Compare September 15, 2017 14:42

massich commented Sep 15, 2017

View reviewed changes

massich mentioned this pull request Sep 15, 2017

calling type_of_target crashes before returning unknown #9777

Closed

lesteve reviewed Sep 19, 2017

View reviewed changes

Merge branch 'master' into 8057

5a3c439

jnothman mentioned this pull request Feb 14, 2018

[WIP] Sparse output KNN #3350

Closed

3 tasks

dorcoh mentioned this pull request Feb 17, 2018

[MRG+1] Added support for sparse multilabel y for Nearest neighbor classifiers #8096

Closed

ogrisel reviewed Jun 6, 2018

View reviewed changes

Merge branch 'master' into 8057

2e6edc5

massich mentioned this pull request Jun 6, 2018

[MRG] [RFC] multiclass and types_of_target #11213

Closed

lucyleeow mentioned this pull request Oct 22, 2019

[MRG] [DOC] multiclass and types_of_target #15333

Merged

github-actions bot added module:neighbors module:utils labels Mar 2, 2020

cmarmo added help wanted Stalled and removed Waiting for Reviewer labels Aug 20, 2020

cmarmo removed this from the 0.19 milestone Aug 20, 2020

Base automatically changed from master to main January 22, 2021 10:49


		return y_pred

		def _mode(self, pred_labels, weights, inliers):

Uh oh!

[MRG+1] Added support for sparse multilabel y for Nearest neighbor classifiers #9059

Are you sure you want to change the base?

[MRG+1] Added support for sparse multilabel y for Nearest neighbor classifiers #9059

Conversation

massich commented Jun 8, 2017

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Jun 8, 2017 via email

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman Sep 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

massich Sep 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

jnothman Sep 14, 2017 •

edited

Loading

massich Sep 15, 2017 •

edited

Loading