Skip to content

Conversation

@dalmia
Copy link
Contributor

@dalmia dalmia commented Dec 21, 2016

Reference Issue

Fixes #8057

What does this implement/fix? Explain your changes.

Added support for sparse multilabel y for nearest neighbor classifiers. Firstly, it checks if the input to fit is sparse + multilabel and converts it to a dense one for storing. Also, it stores another parameter indicating whether the original input was sparse + multilabel or not. Now, in predict, if this stored value is true, then it converts y_pred to sparse CSC.
Also, added tests for the same.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of a sparse datastructure is that if truly sparse, it takes much less memory than the dense equivalent. Imagine we have vary many outputs in a multilabel problem. Storing the n_outputs x n_samples matrix is wasteful. Here you use dense versions as well as sparse versions, so this solution is not helpful at all.


self._issparse = issparse(y)
if(issparse(y) and self.outputs_2d_):
y = y.toarray()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not make a sparse array dense unnecessarily.

for (pl, w)
in zip(pred_labels[inliers], weights[inliers])],
in zip(pred_labels[inliers],
weights[inliers])],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bad indentation

@dalmia
Copy link
Contributor Author

dalmia commented Dec 21, 2016

@jnothman Thank you for giving me such a clear explanation. I'll think of some other way to implement the same.

@dalmia
Copy link
Contributor Author

dalmia commented Dec 21, 2016

When a sparse multilabel y is passed to fit:

classes, self._y[:, k] = np.unique(y[:, k], return_inverse=True)

The above line creates a lot of problems:

>>> np.unique(y_train[:, 3], return_inverse=True)
(array([ <75x1 sparse matrix of type '<type 'numpy.int64'>'
 	with 36 stored elements in Compressed Sparse Row format>], dtype=object),
 array([0]))

Thereby assigning classes the value of the particular column of y_train and thus, the value of y is not stored as self._y. For the sparse multilabel case, I am thus trying to do something else separately.

@dalmia
Copy link
Contributor Author

dalmia commented Dec 21, 2016

Thought of something. It was while implementing, that I clearly understood what you had mentioned while introducing the issue. Learnt something new today :)

@dalmia
Copy link
Contributor Author

dalmia commented Dec 21, 2016

Travis is showing some unexpected error. Tests pass locally.

aman@aman:/media/aman/BE66ECBA66EC7515/Open Source/scikit-learn$ nosetests sklearn/neighbors/tests/test_neighbors.py 
...............................................
----------------------------------------------------------------------
Ran 47 tests in 2.668s

OK

@jnothman
Copy link
Member

jnothman commented Dec 21, 2016 via email

@jnothman
Copy link
Member

jnothman commented Dec 21, 2016 via email

@dalmia
Copy link
Contributor Author

dalmia commented Dec 22, 2016

@jnothman Yes, I had spent the whole of yesterday figuring out what was happening behind the scenes. I think it might have missed your notice, I have indeed made another commit fixing the issue. Now, I am not using conversion to dense at any point of time and I think might just work. Please review.

@jnothman
Copy link
Member

Travis says you're constructing a matrix with data, indices, or indptr that is not 1-dimensional.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about RadiusNeighborsClassifier?

self.classes_.append(classes)
if self._issparse and self.outputs_2d_:
self._y = y
for k in range(self._y.shape[1]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just self.classes_ = [np.array([0, 1], dtype=np.int)] * range(y.shape[1])?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think range should be used here. Other than that, this is much better. Thanks.

if self._issparse:
y_pred_sparse = []

for k, classes_k in enumerate(classes_):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want two separate code-paths for this routine. Putting the condition inside the loop makes the code pretty messy.


if self._issparse and self.outputs_2d_:
if weights is None:
mode, _ = stats.mode(_y[neigh_ind, k].toarray(), axis=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be interesting to implement a sparse (weighted) mode, but this is fine for now. Requires memory O(n_samples x n_neighbors) where sparse is being used to avoid memory O(n_samples x n_outputs)

@dalmia
Copy link
Contributor Author

dalmia commented Dec 29, 2016

I thought that once we have a layout for KNeighborsClassifier set, then we can extend it to RadiusNeighborClassifier more gracefully.


check_classification_targets(y)

self._issparsemultilabel = issparse(y) and self.outputs_2d_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get why this should be stored

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or perhaps we should store outputs_2d_ = 'sparse'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we could store that in this way. Does save the need of using another variable.

y_pred = y_pred.ravel()

if self._issparsemultilabel:
y_pred = hstack(y_pred_sparse_multilabel)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do this in the if/else above

mode = np.asarray(mode.ravel(), dtype=np.intp)
y_pred[:, k] = classes_k.take(mode)

if not self.outputs_2d_:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put this in the else clause

n_samples = X.shape[0]
weights = _get_weights(neigh_dist, self.weights)

y_pred = np.empty((n_samples, n_outputs), dtype=classes_[0].dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you shouldn't be constructing this dense output array. this defies the purpose of using a sparse (low-memory) structure.

@jnothman jnothman changed the title Added support for sparse multilabel y for Nearest neighbor classifiers [WIP] Added support for sparse multilabel y for Nearest neighbor classifiers Dec 29, 2016
@dalmia
Copy link
Contributor Author

dalmia commented Jan 4, 2017

The nosetests run fine locally. Could you please tell me as to why do they differ here?

@jnothman
Copy link
Member

jnothman commented Jan 4, 2017

From the look of which tests are failing (not checked the log as I'm on a slow connection), could it be that your code is not Python 2-friendly?

@dalmia
Copy link
Contributor Author

dalmia commented Jan 4, 2017

That shouldn't be the case as I ran my tests on Python2.7.

@dalmia
Copy link
Contributor Author

dalmia commented Jan 7, 2017

The problem here is related to indexing in sparse matrix. But I've been unable to reproduce the error, so it's a bit hard to debug it.

if not self.outputs_2d_:
y_pred = y_pred.ravel()
if weights is None:
mode, _ = stats.mode(_y[neigh_ind, k].toarray(), axis=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please confirm what neigh_ind and k are exactly so that we can debug what's not working?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neigh_ind here is the list of indices of the kneighbors and k signifies the kth column of the y passed to fit.

@jnothman
Copy link
Member

jnothman commented Jan 7, 2017 via email

@dalmia
Copy link
Contributor Author

dalmia commented Jan 8, 2017

Yes, that is what is indeed creating these errors. I logged the value of neigh_ind appearing in the tests and found quite a few of them to be empty.

@dalmia
Copy link
Contributor Author

dalmia commented Jan 8, 2017

Any workaround that you may for getting it to work?

@dalmia
Copy link
Contributor Author

dalmia commented Jan 9, 2017

Added failure for scipy<0.13. Please review.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should be a way to avoid so much repetition in the code but still have it elegant... I'm just too tired to suggest something specific. :|

mode, _ = weighted_mode(_y[neigh_ind, k], weights, axis=1)
if self.outputs_2d_ == 'sparse':
if StrictVersion(version.version) < StrictVersion('0.13.0'):
raise EnvironmentError('Sparse multilabel y passed in fit. '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do this in fit.

clf.fit(X_train, y_train)

if (name == 'KNeighborsClassifier' and
StrictVersion(version.version) < StrictVersion('0.13.0')):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StrictVersion(scipy.__version__) would be clearer

@dalmia
Copy link
Contributor Author

dalmia commented Jan 9, 2017

I'll try to come up with something then.

@dalmia
Copy link
Contributor Author

dalmia commented Jan 9, 2017

Added calculation for mode in a function which removes the repetitive code greatly. Please have a look.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, making _mode a helper is clearer.


if not self.outputs_2d_:
y_pred = y_pred.ravel()
y_pred = hstack(y_pred_sparse_multilabel)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather see this as sparse.hstack. Thus from scipy import sparse

y_pred_k[inliers] = classes_k.take(mode)

y_pred[inliers, k] = classes_k.take(mode)
if outliers:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this if outliers and self.outlier_label != 0

pred_labels = np.array([_y[ind, k].toarray()
for ind in neigh_ind[inliers]],
dtype=object)
y_pred_k = np.zeros(n_samples)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use an integer dtype, please

dtype=object)
y_pred_k = np.zeros(n_samples)
mode = self._mode(pred_labels, weights, inliers)
y_pred_k[inliers] = classes_k.take(mode)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in the multilabel case, classes should be [0, 1]... I might be wrong.


if outliers:
y_pred[outliers, :] = self.outlier_label
y_pred_sparse_multilabel.append(csc_matrix(y_pred_k).T)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't make sense in terms of efficient data structures. I think you mean csr_matrix(y_pred_k).T, or you mean csc_matrix(y_pred_k.reshape(-1, 1))

for k, classes_k in enumerate(classes_):
mode = self._mode(_y[neigh_ind, k].toarray(), weights)
y_pred_sparse_multilabel.append(
csc_matrix(classes_k.take(mode)).T)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in the multilabel case, classes should be [0, 1]... I might be wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't make sense in terms of efficient data structures. I think you mean csr_matrix(mode).T, or you mean csc_matrix(mode.reshape(-1, 1))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, I missed that.

in zip(pred_labels, weights[inliers])],
dtype=np.int)

mode = mode.ravel()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this necessary?

Copy link
Contributor Author

@dalmia dalmia Jan 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mode returned by the operation above it is of shape (len(inliers),1).

@dalmia
Copy link
Contributor Author

dalmia commented Feb 6, 2017

ping @jnothman

@dalmia dalmia changed the title [WIP] Added support for sparse multilabel y for Nearest neighbor classifiers [MRG] Added support for sparse multilabel y for Nearest neighbor classifiers Feb 20, 2017
@niedakh
Copy link
Contributor

niedakh commented May 20, 2017

i'd like to upvote merging this, I could use more of this support as scikit-multilearn can feed sparse matrices to scikit-learn.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some very minor changes.

I suppose I'll see if someone else wants to wrap this up.

y_pred[:, k] = classes_k.take(mode)
for k, classes_k in enumerate(classes_):
mode = self._mode(_y[neigh_ind, k].toarray(), weights)
y_pred_sparse_multilabel.append(csr_matrix(mode).T)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes more sense to construct a CSC directly from the reshaped array.

self._y = self._y.ravel()

if issparse(y) and self.outputs_2d_:
if StrictVersion(scipy.__version__) < StrictVersion('0.13.0'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer support the broken scipy

'scipy < 0.13')
self.outputs_2d_ = 'sparse'
self._y = y
self.classes_ = [np.array([0, 1], dtype=np.int)] * y.shape[1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should check that the data really doesn't contain other values.


if outliers:
y_pred[outliers, :] = self.outlier_label
y_pred_sparse_multilabel.append(csr_matrix(y_pred_k).T)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here: construct CSC from reshaped array

y_sparse = csc_matrix(y)
X_train, X_test, y_train, y_test = train_test_split(X, y_sparse,
random_state=0)
if StrictVersion(scipy.__version__) < StrictVersion('0.13.0'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove now

@jnothman jnothman mentioned this pull request Jun 12, 2017
4 tasks
@jnothman jnothman changed the title [MRG] Added support for sparse multilabel y for Nearest neighbor classifiers [MRG+1] Added support for sparse multilabel y for Nearest neighbor classifiers Jan 16, 2018
@dorcoh
Copy link
Contributor

dorcoh commented Feb 17, 2018

Hi @jnothman , can I finish this?
I'll be glad if you can describe shortly the changes needed

@jnothman
Copy link
Member

Thanks Dor

It could do with those small changes I request above, but then it will also need a second review before merge. The code changes aren't supremely readable, so any improvements to code quality would help with a second review

@dorcoh
Copy link
Contributor

dorcoh commented Feb 17, 2018

Hi, I just noticed there's another PR for that issue : #9059
Should this one be closed then?

@jnothman jnothman closed this Feb 17, 2018
@jnothman
Copy link
Member

jnothman commented Feb 18, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Nearest neighbors classifiers should support sparse multilabel Y

5 participants