Skip to content

Conversation

@massich
Copy link
Contributor

@massich massich commented Jun 8, 2017

Reference Issue

Fixes #8057

What does this implement/fix? Explain your changes.

Added support for sparse multilabel y for nearest neighbor classifiers. Firstly, it checks if the input to fit is sparse + multilabel and converts it to a dense one for storing. Also, it stores another parameter indicating whether the original input was sparse + multilabel or not. Now, in predict, if this stored value is true, then it converts y_pred to sparse CSC.
Also, added tests for the same.

Any other comments?

This PR wraps up @dalmia 's work in #8096

if issparse(y) and self.outputs_2d_:
self.outputs_2d_ = 'sparse'
self._y = y
self.classes_ = [np.array([0, 1], dtype=np.int)] * y.shape[1]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman where )you asking for something like self.classes_ = set(y) ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just mean that, even if (for no particularly good reason) we only support y being sparse when it is multilabel (not general multioutput) someone might get a surprise by this assumption that it's all binary.

You can't just do set(y). if y is CSC or CSR, checking that np.all(np.union1d(y.data, [0, 1]) == [0, 1]) and raising an error if it fails might help someone unsuspecting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it may not be necessary. if you fix it, it needs a test.

in zip(pred_labels, weights[inliers])],
dtype=np.int)

mode = mode.ravel()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman is mode = mode.ravel() right ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it's fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it's not clear to me why it's necessary, i.e. why mode is not already 1d

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ravel() changes the shape of mode from (_,1) to (_,). Removing it brakes lots of tests. Shall we remove it and fix the tests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be careful here. We should document that input should always be 2D (to use axis=0 from the mode function` and ravel will allow to always return 1D array.

for k, classes_k in enumerate(classes_):
mode = self._mode(_y[neigh_ind, k].toarray(), weights)
y_pred_sparse_multilabel.append(csr_matrix(mode).T)
y_pred_sparse_multilabel.append(csc_matrix(mode).T)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't right. You don't want to transpose the CSC. You want to build a CSC on the reshaped mode. Transposing on a numpy array is very cheap. Transposing a scipy.sparse matrix is more expensive (except for COO format).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd hope that tests don't pass...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool y_pred_sparse_multilabel.append(csc_matrix(mode.T))

Which test in sklearn/cluster/tests/ should break when using csc_matrix(mode).T? I run nosetests sklearn/cluster/ and all tests pass. Shall we add a regression test somwhere?

if issparse(y) and self.outputs_2d_:
self.outputs_2d_ = 'sparse'
self._y = y
self.classes_ = [np.array([0, 1], dtype=np.int)] * y.shape[1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just mean that, even if (for no particularly good reason) we only support y being sparse when it is multilabel (not general multioutput) someone might get a surprise by this assumption that it's all binary.

You can't just do set(y). if y is CSC or CSR, checking that np.all(np.union1d(y.data, [0, 1]) == [0, 1]) and raising an error if it fails might help someone unsuspecting.

for name in CLASSIFIERS:
yield check_classifier_sparse_multilabel_y, name

def test_sparse_multilabel_y():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman is this the test you meant?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remind me where I mentioned a test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman
Copy link
Member

jnothman commented Jun 8, 2017 via email

@jnothman jnothman mentioned this pull request Jun 12, 2017
4 tasks
@massich massich changed the title [WIP] Added support for sparse multilabel y for Nearest neighbor classifiers [MRG] Added support for sparse multilabel y for Nearest neighbor classifiers Jun 14, 2017

clf = neighbors.KNeighborsClassifier(n_neighbors=3)
assert_raises_regex(ValueError,
"Sparse y is only supported for multilabel")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to call the function here, i.e.

assert_raises_regex(ValueError,
                    "Sparse y is only supported for multilabel",
                    clf.fit, csr_matrix(X), y)

def test_sparse_multilabel_y():
rng = check_random_state(0)
n_features = 2
n_samples = 100
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't need to be this many.

n_output = 3

X = rng.rand(n_samples, n_features)
y = rng.randint(0, 5, (n_samples, n_output))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't need to be random. If it is random, we'd rather a fixed random_state: on occasion this could generate a binary array

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that this was taken care by rng = check_random_state(0). I moved the line closer for clarity

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it was, thanks.

@jnothman jnothman added this to the 0.19 milestone Jun 14, 2017
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without checking the tests for the moment


- Add support for sparse multilabel ``y`` in :class:`NeighborsBase`
:issue:`8057` by :user:`Aman Dalmia <dalmia>`, :user:`Joan Massich <massich>`.
- Added :class:`naive_bayes.ComplementNB`, which implements the Complement
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add line and you probably have to move it in 0.20

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is under 0.20

check_classification_targets(y)
try:
check_classification_targets(y)
except ValueError as e:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably add a comment to explain that this is a specific case

if not self.outputs_2d_:
y_pred = y_pred.ravel()
# Old versions of scipy hstack returns COO formatted matrix
y_pred = sparse.hstack(y_pred_sparse_multilabel).tocsc()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sparse.hstack is taking a format parameter:

sparse.hstack(y_pred_sparse_multilabel, format='csc')

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that this is for readability, the performance will be the same

if not self.outputs_2d_:
y_pred = y_pred.ravel()
# Old versions of scipy hstack returns COO formatted matrix
y_pred = sparse.hstack(y_pred_sparse_multilabel).tocsc()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use format


mode = mode.ravel()
for k, classes_k in enumerate(classes_):
pred_labels = np.array([_y[ind, k].toarray()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this is converted to object. if we have a sparse matrix, we have for sure numerical values.
Then we don't even have to construct a dense matrix. we can just pass *.data from the sparse matrix

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case we need also to slice the weights

Copy link
Member

@jnothman jnothman Sep 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is extracting a single column for all of the neighbors of each query. We are making it dense here, at a cost of (n_samples * avg_neighbors) memory. The sparse representation is to avoid a dense matrix of (n_samples * n_classes) where n_classes is presumed to be large. Yes, we can avoid making this dense, but I don't think it's especially problematic as long as avg_neighbours << n_classes.

We should be careful about constructing arrays of arrays/lists though. When the lengths of the arrays are all the same, this becomes a 2d array whose elements are ints, but still of dtype object due to this constructor. The safe way to do such construction is with np.empty(n, dtype='O') and then fill the array. But I don't see why pred_labels should be an array here at all.

Now that I've clarified that to myself, I'll go see what massich#5 has to say.

Copy link
Contributor Author

@massich massich Sep 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conclusion? we materialize the matrix here and later we can modify the mode computation propagating the sparse matrix later if needed.
If so, we could use mode = 1 if ( np.product(x.shape) - x.nnz ) < x.nnz else 0

And regarding the dtype=object, shall we put an assert that breaks when all the elements in pred_labels have the same length? Just to inform that this is an unlikely corner case. And we open an Issue to fix it at some point?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be careful about constructing arrays of arrays/lists though.

That is correct, so in the unlikely cornercase that each sample has the same number of neighbours, the current code is probably broken.

But I don't see why pred_labels should be an array here at all.

The reason is because further down below, we use fancy numpy indexing with a list on this array, which does not work for a list (in pred_labels[inliers]). That can of course be converted to list comprehension, but might have performance bottleneck (not sure this is the case for object array)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the zip(pred_labels[inliers], weights[inliers])] works fine as well if it is a 2D array instead of 1D object array? So I think we can leave this as is.

else:
y_pred = np.empty((n_samples, n_outputs), dtype=classes_[0].dtype)
for k, classes_k in enumerate(classes_):
pred_labels = np.array([_y[ind, k] for ind in neigh_ind],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same concern than ealier regarding the sparse matrix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one is not sparse.


def _mode(self, pred_labels, weights, inliers):
if weights is None:
mode = np.array([stats.mode(pl)[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why using a list comprehension and not using axis option from stats.mode or weighted_mode?

in zip(pred_labels, weights[inliers])],
dtype=np.int)

mode = mode.ravel()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be careful here. We should document that input should always be 2D (to use axis=0 from the mode function` and ravel will allow to always return 1D array.


return y_pred

def _mode(self, pred_labels, weights, inliers):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move the function above it is used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not pass inliers because we take inliers from weights but not from pred_labels. It seems inconsistant.

we should have either pred_labels[inliers], weight[inliers] or pred_inliers, weight_inliers passed as argument.
I am more in favor of the second because it will be easier with sparse matrices.

@massich massich mentioned this pull request Sep 14, 2017
@massich
Copy link
Contributor Author

massich commented Sep 14, 2017

When addressing this comment by @glemaitre, I stumble into some problems. I don't really know how to post a concrete question, so I collected my thoughts here as a PR to this PR and I would love to get some feedback. Thx.

cc: @jnothman, @lesteve

@massich massich force-pushed the 8057 branch 2 times, most recently from 2192eb7 to c4a47e3 Compare September 15, 2017 13:19
The initial idea with @ogrisel was to change the ValueError's meassage for
something more descriptive like this:

```py
  raise ValueError("Unknown classification label type. Got: %r" % y)
```

The main problem is that such message breaks this:

```sh
  $pytest sklearn/tests/test_common.py::test_non_meta_estimators
```
" supported). Got: %r" % y)
else:
raise
raise ValueError("Unknown label type: %r" % y)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial idea with @ogrisel was to change the ValueError's meassage for something more descriptive like this:

  raise ValueError("Unknown classification label type. Got: %r" % y)

The main problem is that such message breaks this:

  $pytest sklearn/tests/test_common.py::test_non_meta_estimators

If needed I'll change the message in a separated PR, so that no common files are changed here.

@@ -4,6 +4,5758 @@
===============
Release History
===============

Version 0.20 (under development)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want to add your entry to v0.20.0.rst (or something like this) rather than change whats_new.rst.

@glemaitre
Copy link
Member

@massich Could you solve the conflict and address the remaining comments

@massich
Copy link
Contributor Author

massich commented Mar 15, 2018

@dorcoh feel free to take over this one! let me know if you need anything

mode, _ = weighted_mode(neigh, weights, axis=1)

mode = np.asarray(mode.ravel(), dtype=np.intp)
return mode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style:

return np.asarray(mode.ravel(), dtype=np.intp)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Nearest neighbors classifiers should support sparse multilabel Y

10 participants