-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
[MRG+1] Added support for sparse multilabel y for Nearest neighbor classifiers #9059
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
sklearn/neighbors/base.py
Outdated
| if issparse(y) and self.outputs_2d_: | ||
| self.outputs_2d_ = 'sparse' | ||
| self._y = y | ||
| self.classes_ = [np.array([0, 1], dtype=np.int)] * y.shape[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnothman where )you asking for something like self.classes_ = set(y) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just mean that, even if (for no particularly good reason) we only support y being sparse when it is multilabel (not general multioutput) someone might get a surprise by this assumption that it's all binary.
You can't just do set(y). if y is CSC or CSR, checking that np.all(np.union1d(y.data, [0, 1]) == [0, 1]) and raising an error if it fails might help someone unsuspecting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but it may not be necessary. if you fix it, it needs a test.
sklearn/neighbors/classification.py
Outdated
| in zip(pred_labels, weights[inliers])], | ||
| dtype=np.int) | ||
|
|
||
| mode = mode.ravel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnothman is mode = mode.ravel() right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now it's not clear to me why it's necessary, i.e. why mode is not already 1d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ravel() changes the shape of mode from (_,1) to (_,). Removing it brakes lots of tests. Shall we remove it and fix the tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be careful here. We should document that input should always be 2D (to use axis=0 from the mode function` and ravel will allow to always return 1D array.
sklearn/neighbors/classification.py
Outdated
| for k, classes_k in enumerate(classes_): | ||
| mode = self._mode(_y[neigh_ind, k].toarray(), weights) | ||
| y_pred_sparse_multilabel.append(csr_matrix(mode).T) | ||
| y_pred_sparse_multilabel.append(csc_matrix(mode).T) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't right. You don't want to transpose the CSC. You want to build a CSC on the reshaped mode. Transposing on a numpy array is very cheap. Transposing a scipy.sparse matrix is more expensive (except for COO format).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd hope that tests don't pass...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool y_pred_sparse_multilabel.append(csc_matrix(mode.T))
Which test in sklearn/cluster/tests/ should break when using csc_matrix(mode).T? I run nosetests sklearn/cluster/ and all tests pass. Shall we add a regression test somwhere?
sklearn/neighbors/base.py
Outdated
| if issparse(y) and self.outputs_2d_: | ||
| self.outputs_2d_ = 'sparse' | ||
| self._y = y | ||
| self.classes_ = [np.array([0, 1], dtype=np.int)] * y.shape[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just mean that, even if (for no particularly good reason) we only support y being sparse when it is multilabel (not general multioutput) someone might get a surprise by this assumption that it's all binary.
You can't just do set(y). if y is CSC or CSR, checking that np.all(np.union1d(y.data, [0, 1]) == [0, 1]) and raising an error if it fails might help someone unsuspecting.
| for name in CLASSIFIERS: | ||
| yield check_classifier_sparse_multilabel_y, name | ||
|
|
||
| def test_sparse_multilabel_y(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnothman is this the test you meant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remind me where I mentioned a test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
i don't think mode.T will actually work because mode is 1d
…On 9 Jun 2017 12:54 am, "Joan Massich" ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In sklearn/neighbors/tests/test_neighbors.py
<#9059 (comment)>
:
> + clf.fit(X_train, y_train)
+ y_pred = clf.predict(X_test)
+
+ y_sparse = csc_matrix(y)
+ X_train, X_test, y_train, y_test = train_test_split(X, y_sparse,
+ random_state=0)
+ clf.fit(X_train, y_train)
+ y_sparse_pred = clf.predict(X_test)
+ assert_array_equal(y_pred, y_sparse_pred.toarray())
+
+
+def test_classifiers_sparse_multilabel_y():
+ for name in CLASSIFIERS:
+ yield check_classifier_sparse_multilabel_y, name
+
+def test_sparse_multilabel_y():
@jnothman <https://github.com/jnothman> is this the test you meant?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9059 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz69sr9kBizi_tTDz8g-GZpNQTfG6kks5sCAsmgaJpZM4N0Bnb>
.
|
|
|
||
| clf = neighbors.KNeighborsClassifier(n_neighbors=3) | ||
| assert_raises_regex(ValueError, | ||
| "Sparse y is only supported for multilabel") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to call the function here, i.e.
assert_raises_regex(ValueError,
"Sparse y is only supported for multilabel",
clf.fit, csr_matrix(X), y)| def test_sparse_multilabel_y(): | ||
| rng = check_random_state(0) | ||
| n_features = 2 | ||
| n_samples = 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't need to be this many.
| n_output = 3 | ||
|
|
||
| X = rng.rand(n_samples, n_features) | ||
| y = rng.randint(0, 5, (n_samples, n_output)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't need to be random. If it is random, we'd rather a fixed random_state: on occasion this could generate a binary array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that this was taken care by rng = check_random_state(0). I moved the line closer for clarity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed it was, thanks.
glemaitre
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without checking the tests for the moment
|
|
||
| - Add support for sparse multilabel ``y`` in :class:`NeighborsBase` | ||
| :issue:`8057` by :user:`Aman Dalmia <dalmia>`, :user:`Joan Massich <massich>`. | ||
| - Added :class:`naive_bayes.ComplementNB`, which implements the Complement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add line and you probably have to move it in 0.20
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is under 0.20
sklearn/neighbors/base.py
Outdated
| check_classification_targets(y) | ||
| try: | ||
| check_classification_targets(y) | ||
| except ValueError as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably add a comment to explain that this is a specific case
sklearn/neighbors/classification.py
Outdated
| if not self.outputs_2d_: | ||
| y_pred = y_pred.ravel() | ||
| # Old versions of scipy hstack returns COO formatted matrix | ||
| y_pred = sparse.hstack(y_pred_sparse_multilabel).tocsc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sparse.hstack is taking a format parameter:
sparse.hstack(y_pred_sparse_multilabel, format='csc')There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that this is for readability, the performance will be the same
sklearn/neighbors/classification.py
Outdated
| if not self.outputs_2d_: | ||
| y_pred = y_pred.ravel() | ||
| # Old versions of scipy hstack returns COO formatted matrix | ||
| y_pred = sparse.hstack(y_pred_sparse_multilabel).tocsc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use format
|
|
||
| mode = mode.ravel() | ||
| for k, classes_k in enumerate(classes_): | ||
| pred_labels = np.array([_y[ind, k].toarray() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this is converted to object. if we have a sparse matrix, we have for sure numerical values.
Then we don't even have to construct a dense matrix. we can just pass *.data from the sparse matrix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case we need also to slice the weights
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is extracting a single column for all of the neighbors of each query. We are making it dense here, at a cost of (n_samples * avg_neighbors) memory. The sparse representation is to avoid a dense matrix of (n_samples * n_classes) where n_classes is presumed to be large. Yes, we can avoid making this dense, but I don't think it's especially problematic as long as avg_neighbours << n_classes.
We should be careful about constructing arrays of arrays/lists though. When the lengths of the arrays are all the same, this becomes a 2d array whose elements are ints, but still of dtype object due to this constructor. The safe way to do such construction is with np.empty(n, dtype='O') and then fill the array. But I don't see why pred_labels should be an array here at all.
Now that I've clarified that to myself, I'll go see what massich#5 has to say.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conclusion? we materialize the matrix here and later we can modify the mode computation propagating the sparse matrix later if needed.
If so, we could use mode = 1 if ( np.product(x.shape) - x.nnz ) < x.nnz else 0
And regarding the dtype=object, shall we put an assert that breaks when all the elements in pred_labels have the same length? Just to inform that this is an unlikely corner case. And we open an Issue to fix it at some point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be careful about constructing arrays of arrays/lists though.
That is correct, so in the unlikely cornercase that each sample has the same number of neighbours, the current code is probably broken.
But I don't see why pred_labels should be an array here at all.
The reason is because further down below, we use fancy numpy indexing with a list on this array, which does not work for a list (in pred_labels[inliers]). That can of course be converted to list comprehension, but might have performance bottleneck (not sure this is the case for object array)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the zip(pred_labels[inliers], weights[inliers])] works fine as well if it is a 2D array instead of 1D object array? So I think we can leave this as is.
| else: | ||
| y_pred = np.empty((n_samples, n_outputs), dtype=classes_[0].dtype) | ||
| for k, classes_k in enumerate(classes_): | ||
| pred_labels = np.array([_y[ind, k] for ind in neigh_ind], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the same concern than ealier regarding the sparse matrix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one is not sparse.
sklearn/neighbors/classification.py
Outdated
|
|
||
| def _mode(self, pred_labels, weights, inliers): | ||
| if weights is None: | ||
| mode = np.array([stats.mode(pl)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why using a list comprehension and not using axis option from stats.mode or weighted_mode?
sklearn/neighbors/classification.py
Outdated
| in zip(pred_labels, weights[inliers])], | ||
| dtype=np.int) | ||
|
|
||
| mode = mode.ravel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be careful here. We should document that input should always be 2D (to use axis=0 from the mode function` and ravel will allow to always return 1D array.
sklearn/neighbors/classification.py
Outdated
|
|
||
| return y_pred | ||
|
|
||
| def _mode(self, pred_labels, weights, inliers): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move the function above it is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not pass inliers because we take inliers from weights but not from pred_labels. It seems inconsistant.
we should have either pred_labels[inliers], weight[inliers] or pred_inliers, weight_inliers passed as argument.
I am more in favor of the second because it will be easier with sparse matrices.
|
When addressing this comment by @glemaitre, I stumble into some problems. I don't really know how to post a concrete question, so I collected my thoughts here as a PR to this PR and I would love to get some feedback. Thx. |
2192eb7 to
c4a47e3
Compare
The initial idea with @ogrisel was to change the ValueError's meassage for something more descriptive like this: ```py raise ValueError("Unknown classification label type. Got: %r" % y) ``` The main problem is that such message breaks this: ```sh $pytest sklearn/tests/test_common.py::test_non_meta_estimators ```
| " supported). Got: %r" % y) | ||
| else: | ||
| raise | ||
| raise ValueError("Unknown label type: %r" % y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The initial idea with @ogrisel was to change the ValueError's meassage for something more descriptive like this:
raise ValueError("Unknown classification label type. Got: %r" % y)The main problem is that such message breaks this:
$pytest sklearn/tests/test_common.py::test_non_meta_estimatorsIf needed I'll change the message in a separated PR, so that no common files are changed here.
| @@ -4,6 +4,5758 @@ | |||
| =============== | |||
| Release History | |||
| =============== | |||
|
|
|||
| Version 0.20 (under development) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You want to add your entry to v0.20.0.rst (or something like this) rather than change whats_new.rst.
|
@massich Could you solve the conflict and address the remaining comments |
|
@dorcoh feel free to take over this one! let me know if you need anything |
| mode, _ = weighted_mode(neigh, weights, axis=1) | ||
|
|
||
| mode = np.asarray(mode.ravel(), dtype=np.intp) | ||
| return mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style:
return np.asarray(mode.ravel(), dtype=np.intp)
Reference Issue
Fixes #8057
What does this implement/fix? Explain your changes.
Added support for sparse multilabel y for nearest neighbor classifiers. Firstly, it checks if the input to fit is sparse + multilabel and converts it to a dense one for storing. Also, it stores another parameter indicating whether the original input was sparse + multilabel or not. Now, in predict, if this stored value is true, then it converts y_pred to sparse CSC.
Also, added tests for the same.
Any other comments?
This PR wraps up @dalmia 's work in #8096