-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
[MRG+1] Added support for sparse multilabel y for Nearest neighbor classifiers #8096
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jnothman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of a sparse datastructure is that if truly sparse, it takes much less memory than the dense equivalent. Imagine we have vary many outputs in a multilabel problem. Storing the n_outputs x n_samples matrix is wasteful. Here you use dense versions as well as sparse versions, so this solution is not helpful at all.
sklearn/neighbors/base.py
Outdated
|
|
||
| self._issparse = issparse(y) | ||
| if(issparse(y) and self.outputs_2d_): | ||
| y = y.toarray() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should not make a sparse array dense unnecessarily.
sklearn/neighbors/classification.py
Outdated
| for (pl, w) | ||
| in zip(pred_labels[inliers], weights[inliers])], | ||
| in zip(pred_labels[inliers], | ||
| weights[inliers])], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bad indentation
|
@jnothman Thank you for giving me such a clear explanation. I'll think of some other way to implement the same. |
|
When a sparse multilabel y is passed to classes, self._y[:, k] = np.unique(y[:, k], return_inverse=True)The above line creates a lot of problems: >>> np.unique(y_train[:, 3], return_inverse=True)
(array([ <75x1 sparse matrix of type '<type 'numpy.int64'>'
with 36 stored elements in Compressed Sparse Row format>], dtype=object),
array([0]))Thereby assigning |
|
Thought of something. It was while implementing, that I clearly understood what you had mentioned while introducing the issue. Learnt something new today :) |
|
Travis is showing some unexpected error. Tests pass locally. aman@aman:/media/aman/BE66ECBA66EC7515/Open Source/scikit-learn$ nosetests sklearn/neighbors/tests/test_neighbors.py
...............................................
----------------------------------------------------------------------
Ran 47 tests in 2.668s
OK |
|
Yes, that code won't work neatly for both sparse and dense
…On 22 December 2016 at 00:49, Aman Dalmia ***@***.***> wrote:
When a sparse multilabel y is passed to fit:
classes, self._y[:, k] = np.unique(y[:, k], return_inverse=True)
The above line creates a lot of problems:
>>> np.unique(y_train[:, 3], return_inverse=True)
(array([ <75x1 sparse matrix of type '<type 'numpy.int64'>'
with 36 stored elements in Compressed Sparse Row format>], dtype=object),
array([0]))
Thereby assigning classes the value of the particular column of y_train
and thus, the value of y is not stored as self._y. For the sparse
multilabel case, I am thus trying to do something else separately.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8096 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz62l-MJwdEZNX8cehMZQyPUUT3OlBks5rKS5zgaJpZM4LSzU6>
.
|
|
But you can use np.unique() on subsets of data in a CSC. Go read how CSC
works internally and see if you can puzzle it out.
…On 22 December 2016 at 07:30, Joel Nothman ***@***.***> wrote:
Yes, that code won't work neatly for both sparse and dense
On 22 December 2016 at 00:49, Aman Dalmia ***@***.***>
wrote:
> When a sparse multilabel y is passed to fit:
>
> classes, self._y[:, k] = np.unique(y[:, k], return_inverse=True)
>
> The above line creates a lot of problems:
>
> >>> np.unique(y_train[:, 3], return_inverse=True)
> (array([ <75x1 sparse matrix of type '<type 'numpy.int64'>'
> with 36 stored elements in Compressed Sparse Row format>], dtype=object),
> array([0]))
>
> Thereby assigning classes the value of the particular column of y_train
> and thus, the value of y is not stored as self._y. For the sparse
> multilabel case, I am thus trying to do something else separately.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#8096 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAEz62l-MJwdEZNX8cehMZQyPUUT3OlBks5rKS5zgaJpZM4LSzU6>
> .
>
|
|
@jnothman Yes, I had spent the whole of yesterday figuring out what was happening behind the scenes. I think it might have missed your notice, I have indeed made another commit fixing the issue. Now, I am not using conversion to dense at any point of time and I think might just work. Please review. |
|
Travis says you're constructing a matrix with |
jnothman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about RadiusNeighborsClassifier?
sklearn/neighbors/base.py
Outdated
| self.classes_.append(classes) | ||
| if self._issparse and self.outputs_2d_: | ||
| self._y = y | ||
| for k in range(self._y.shape[1]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just self.classes_ = [np.array([0, 1], dtype=np.int)] * range(y.shape[1])?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think range should be used here. Other than that, this is much better. Thanks.
sklearn/neighbors/classification.py
Outdated
| if self._issparse: | ||
| y_pred_sparse = [] | ||
|
|
||
| for k, classes_k in enumerate(classes_): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you want two separate code-paths for this routine. Putting the condition inside the loop makes the code pretty messy.
sklearn/neighbors/classification.py
Outdated
|
|
||
| if self._issparse and self.outputs_2d_: | ||
| if weights is None: | ||
| mode, _ = stats.mode(_y[neigh_ind, k].toarray(), axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be interesting to implement a sparse (weighted) mode, but this is fine for now. Requires memory O(n_samples x n_neighbors) where sparse is being used to avoid memory O(n_samples x n_outputs)
|
I thought that once we have a layout for |
sklearn/neighbors/base.py
Outdated
|
|
||
| check_classification_targets(y) | ||
|
|
||
| self._issparsemultilabel = issparse(y) and self.outputs_2d_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get why this should be stored
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or perhaps we should store outputs_2d_ = 'sparse'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we could store that in this way. Does save the need of using another variable.
sklearn/neighbors/classification.py
Outdated
| y_pred = y_pred.ravel() | ||
|
|
||
| if self._issparsemultilabel: | ||
| y_pred = hstack(y_pred_sparse_multilabel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do this in the if/else above
sklearn/neighbors/classification.py
Outdated
| mode = np.asarray(mode.ravel(), dtype=np.intp) | ||
| y_pred[:, k] = classes_k.take(mode) | ||
|
|
||
| if not self.outputs_2d_: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put this in the else clause
sklearn/neighbors/classification.py
Outdated
| n_samples = X.shape[0] | ||
| weights = _get_weights(neigh_dist, self.weights) | ||
|
|
||
| y_pred = np.empty((n_samples, n_outputs), dtype=classes_[0].dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you shouldn't be constructing this dense output array. this defies the purpose of using a sparse (low-memory) structure.
|
The nosetests run fine locally. Could you please tell me as to why do they differ here? |
|
From the look of which tests are failing (not checked the log as I'm on a slow connection), could it be that your code is not Python 2-friendly? |
|
That shouldn't be the case as I ran my tests on Python2.7. |
|
The problem here is related to indexing in sparse matrix. But I've been unable to reproduce the error, so it's a bit hard to debug it. |
sklearn/neighbors/classification.py
Outdated
| if not self.outputs_2d_: | ||
| y_pred = y_pred.ravel() | ||
| if weights is None: | ||
| mode, _ = stats.mode(_y[neigh_ind, k].toarray(), axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please confirm what neigh_ind and k are exactly so that we can debug what's not working?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neigh_ind here is the list of indices of the kneighbors and k signifies the kth column of the y passed to fit.
|
One of the issues may derive from `neigh_ind` having 0 elements, and that
case not being handled in scipy (indeed, we may still support a version of
scipy where sparse matrices with 0 elements on one axis were not supported
at all).
…On 8 January 2017 at 01:09, Aman Dalmia ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In sklearn/neighbors/classification.py
<#8096>:
>
- if not self.outputs_2d_:
- y_pred = y_pred.ravel()
+ if weights is None:
+ mode, _ = stats.mode(_y[neigh_ind, k].toarray(), axis=1)
neigh_ind here is the list of indices of the kneighbors and k signifies
the kth column of the y passed to fit.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8096>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/AAEz6xkcCndOpDTn9-WDyj2j39LJP8VWks5rP5yagaJpZM4LSzU6>
.
|
|
Yes, that is what is indeed creating these errors. I logged the value of |
|
Any workaround that you may for getting it to work? |
|
Added failure for scipy<0.13. Please review. |
jnothman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there should be a way to avoid so much repetition in the code but still have it elegant... I'm just too tired to suggest something specific. :|
sklearn/neighbors/classification.py
Outdated
| mode, _ = weighted_mode(_y[neigh_ind, k], weights, axis=1) | ||
| if self.outputs_2d_ == 'sparse': | ||
| if StrictVersion(version.version) < StrictVersion('0.13.0'): | ||
| raise EnvironmentError('Sparse multilabel y passed in fit. ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do this in fit.
| clf.fit(X_train, y_train) | ||
|
|
||
| if (name == 'KNeighborsClassifier' and | ||
| StrictVersion(version.version) < StrictVersion('0.13.0')): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StrictVersion(scipy.__version__) would be clearer
|
I'll try to come up with something then. |
|
Added calculation for mode in a function which removes the repetitive code greatly. Please have a look. |
jnothman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, making _mode a helper is clearer.
sklearn/neighbors/classification.py
Outdated
|
|
||
| if not self.outputs_2d_: | ||
| y_pred = y_pred.ravel() | ||
| y_pred = hstack(y_pred_sparse_multilabel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather see this as sparse.hstack. Thus from scipy import sparse
sklearn/neighbors/classification.py
Outdated
| y_pred_k[inliers] = classes_k.take(mode) | ||
|
|
||
| y_pred[inliers, k] = classes_k.take(mode) | ||
| if outliers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this if outliers and self.outlier_label != 0
sklearn/neighbors/classification.py
Outdated
| pred_labels = np.array([_y[ind, k].toarray() | ||
| for ind in neigh_ind[inliers]], | ||
| dtype=object) | ||
| y_pred_k = np.zeros(n_samples) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use an integer dtype, please
sklearn/neighbors/classification.py
Outdated
| dtype=object) | ||
| y_pred_k = np.zeros(n_samples) | ||
| mode = self._mode(pred_labels, weights, inliers) | ||
| y_pred_k[inliers] = classes_k.take(mode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in the multilabel case, classes should be [0, 1]... I might be wrong.
sklearn/neighbors/classification.py
Outdated
|
|
||
| if outliers: | ||
| y_pred[outliers, :] = self.outlier_label | ||
| y_pred_sparse_multilabel.append(csc_matrix(y_pred_k).T) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't make sense in terms of efficient data structures. I think you mean csr_matrix(y_pred_k).T, or you mean csc_matrix(y_pred_k.reshape(-1, 1))
sklearn/neighbors/classification.py
Outdated
| for k, classes_k in enumerate(classes_): | ||
| mode = self._mode(_y[neigh_ind, k].toarray(), weights) | ||
| y_pred_sparse_multilabel.append( | ||
| csc_matrix(classes_k.take(mode)).T) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in the multilabel case, classes should be [0, 1]... I might be wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't make sense in terms of efficient data structures. I think you mean csr_matrix(mode).T, or you mean csc_matrix(mode.reshape(-1, 1))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, I missed that.
| in zip(pred_labels, weights[inliers])], | ||
| dtype=np.int) | ||
|
|
||
| mode = mode.ravel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mode returned by the operation above it is of shape (len(inliers),1).
|
ping @jnothman |
|
i'd like to upvote merging this, I could use more of this support as scikit-multilearn can feed sparse matrices to scikit-learn. |
jnothman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some very minor changes.
I suppose I'll see if someone else wants to wrap this up.
| y_pred[:, k] = classes_k.take(mode) | ||
| for k, classes_k in enumerate(classes_): | ||
| mode = self._mode(_y[neigh_ind, k].toarray(), weights) | ||
| y_pred_sparse_multilabel.append(csr_matrix(mode).T) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes more sense to construct a CSC directly from the reshaped array.
| self._y = self._y.ravel() | ||
|
|
||
| if issparse(y) and self.outputs_2d_: | ||
| if StrictVersion(scipy.__version__) < StrictVersion('0.13.0'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We no longer support the broken scipy
| 'scipy < 0.13') | ||
| self.outputs_2d_ = 'sparse' | ||
| self._y = y | ||
| self.classes_ = [np.array([0, 1], dtype=np.int)] * y.shape[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should check that the data really doesn't contain other values.
|
|
||
| if outliers: | ||
| y_pred[outliers, :] = self.outlier_label | ||
| y_pred_sparse_multilabel.append(csr_matrix(y_pred_k).T) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here: construct CSC from reshaped array
| y_sparse = csc_matrix(y) | ||
| X_train, X_test, y_train, y_test = train_test_split(X, y_sparse, | ||
| random_state=0) | ||
| if StrictVersion(scipy.__version__) < StrictVersion('0.13.0'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove now
|
Hi @jnothman , can I finish this? |
|
Thanks Dor It could do with those small changes I request above, but then it will also need a second review before merge. The code changes aren't supremely readable, so any improvements to code quality would help with a second review |
|
Hi, I just noticed there's another PR for that issue : #9059 |
|
indeed our record keeping is pretty terrible
|
Reference Issue
Fixes #8057
What does this implement/fix? Explain your changes.
Added support for sparse multilabel
yfor nearest neighbor classifiers. Firstly, it checks if the input tofitis sparse + multilabel and converts it to a dense one for storing. Also, it stores another parameter indicating whether the original input was sparse + multilabel or not. Now, inpredict, if this stored value is true, then it convertsy_predto sparse CSC.Also, added tests for the same.