-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
[MRG+2] Outlier detection algorithms API consistency #9015
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sklearn/neighbors/lof.py
Outdated
| self.threshold_ = -scoreatpercentile( | ||
| -self.negative_outlier_factor_, 100. * (1. - self.contamination)) | ||
| if self.contamination is None: | ||
| self.threshold_ = -1.1 # inliers score around 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1.5 is used in the paper in the experiments section. But it does not work on small dataset (the added tests break). I can put 1.5 and remove the added test in the case contamination is None.
|
waiting for #9018 to be merged before dealing with EllipticEnvelop API |
sklearn/ensemble/iforest.py
Outdated
| return -scores | ||
|
|
||
| def score_samples(self, X): | ||
| """Opposite of the anomaly score define in the original paper. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defined
sklearn/ensemble/iforest.py
Outdated
| Returns | ||
| ------- | ||
| scores : array of shape (n_samples,) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
array, shape (n_samples,)
| assert_greater(np.min(decision_func1[-2:]), np.max(decision_func1[:-2])) | ||
| assert_greater(np.min(decision_func2[-2:]), np.max(decision_func2[:-2])) | ||
| assert_array_equal(pred1, 6 * [1] + 2 * [-1]) | ||
| assert_array_equal(pred2, 6 * [1] + 2 * [-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use a for loop with contamination to avoid the dupes
sklearn/neighbors/lof.py
Outdated
| self.threshold_ = -scoreatpercentile( | ||
| -self.negative_outlier_factor_, 100. * (1. - self.contamination)) | ||
| if self.contamination is None: | ||
| self.offset_ = 1.5 # inliers score around 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you document the offset_ attribute in the docstrings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot as threshold_ wasn't documented, i will do this
| else: # we're in training | ||
| return scores | ||
|
|
||
| def _score_samples(self, X): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need this private function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want a score_samples method for every anomaly detection algo (returning the "raw" decision function as defined in original papers).
Here it has to be private for the same reason as decision function and predict are private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides what is written in the docstring I think it would be good to add a comment in the code saying that predict is private because fit_predict(X) would be different than fit(X).predict(X), and that decision_function and score_samples are private because predict is private. I think this is the real reason behind making these methods private. Otherwise we could have extended the original paper approach (which can be done using the private methods). If someone dives into the code and wants to use the private methods he should be aware of that, which is not mentioned in the current version of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need a score_samples for the OneClassSVM. A suggestion:
For IsolationForest, LocalOutlierFactorand EllipticEnvelope all the work to compute the score of normality is currently done in decision_function. decision_function is then called in score_samples and offset_ has to be used in both methods.
IMO it would maybe be better to do the opposite: do all the work to compute the score in score_samples and define decision_function by score_samples(X) - offset_. I find it more natural (and easier to understand) to compute the score first and threshold this score to obtain decision_function and predict instead of computing decision_function and un-threshold decision_function to obtain the score. This would also involve the following changes:
- For
IsolationForestyou can callscore_samplesinstead ofdecision_functioninfitand computeoffset_fromscore_samples(X).decision_functionwould then bescore_samples(X) - offset_. And I think that you don't need the following if statement anymore indecision_function
if hasattr(self, 'offset_'): # means that we're in testing
return -scores + self.offset_
else: # we're in training
return -scores-
For
LocalOutlierFactor,_decision_functionwould be_score_samples(X) - offset_and as in the current version, no need to callscore_samplesnordecision_functioninfit. -
For
EllipticEnvelope,decision_functionwould bescore_samples(X) - offset_and as in the current version, no need to callscore_samplesnordecision_functioninfit. -
For the
OneClassSVMwe might have to stick with the current solution forscore_samples, i.e,score_samples(X) = decision_function(X) + offset_asdecision_functionuses the LIBSVM interface?
sklearn/neighbors/tests/test_lof.py
Outdated
| assert_equal(clf.n_neighbors_, X.shape[0] - 1) | ||
|
|
||
|
|
||
| def test__score_samples(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_score_samples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we test _score_samples not score_samples so I'm not sure of the convention...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have removed the duplicated _ as well, but it's not a big deal.
| else: # we're in training | ||
| return scores | ||
|
|
||
| def _score_samples(self, X): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides what is written in the docstring I think it would be good to add a comment in the code saying that predict is private because fit_predict(X) would be different than fit(X).predict(X), and that decision_function and score_samples are private because predict is private. I think this is the real reason behind making these methods private. Otherwise we could have extended the original paper approach (which can be done using the private methods). If someone dives into the code and wants to use the private methods he should be aware of that, which is not mentioned in the current version of the code.
sklearn/neighbors/lof.py
Outdated
| Additional keyword arguments for the metric function. | ||
| contamination : float in (0., 0.5), optional (default=0.1) | ||
| contamination : float in (0., 0.5), optional (default=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None --> 'auto'
sklearn/neighbors/lof.py
Outdated
| The amount of contamination of the data set, i.e. the proportion | ||
| of outliers in the data set. When fitting this is used to define the | ||
| threshold on the decision function. | ||
| threshold on the decision function. If None, the decision function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None --> 'auto'
sklearn/neighbors/lof.py
Outdated
| def __init__(self, n_neighbors=20, algorithm='auto', leaf_size=30, | ||
| metric='minkowski', p=2, metric_params=None, | ||
| contamination=0.1, n_jobs=1): | ||
| contamination=None, n_jobs=1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None --> 'auto'
sklearn/neighbors/lof.py
Outdated
| """ | ||
| if not (0. < self.contamination <= .5): | ||
| raise ValueError("contamination must be in (0, 0.5]") | ||
| if self.contamination is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs to be adapted to 'auto' instead of None
sklearn/neighbors/lof.py
Outdated
|
|
||
| self.threshold_ = -scoreatpercentile( | ||
| -self.negative_outlier_factor_, 100. * (1. - self.contamination)) | ||
| if self.contamination is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.contamination == 'auto'
sklearn/neighbors/lof.py
Outdated
|
|
||
| if hasattr(self, 'offset_'): # means that we're in testing | ||
| return scores + self.offset_ # to make value 0 special | ||
| else: # we're in training |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is useless here as _decision_function does not seem to be called during fit (there is a check_is_fitted in _decision_function).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right, thanks!
| is_inlier : array, shape (n_samples,) | ||
| For each observations, tells whether or not (+1 or -1) it should | ||
| be considered as an inlier according to the fitted model. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a check_is_fitted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sklearn/ensemble/iforest.py
Outdated
| # abnormal) and substract self.offset_ to make 0 be the threshold | ||
| # value for being an outlier. | ||
|
|
||
| if hasattr(self, 'offset_'): # means that we're in testing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my general review feedback about this if-else statement.
|
@ngoix why is this still WIP? |
|
It's not, I make the change. |
albertcthomas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments. We also need to make sure that removing the raw_values parameter in EllipticEnvelope, which will require a deprecation cycle, is OK. Otherwise LGTM.
|
|
||
| def decision_function(self, X, raw_values=False): | ||
| def decision_function(self, X): | ||
| """Compute the decision function of the given observations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we remove the raw_values we need a deprecation warning
| else: | ||
| transformed_mahal_dist = mahal_dist ** 0.33 | ||
| decision = self.threshold_ ** 0.33 - transformed_mahal_dist | ||
| negative_mahal_dist = self.score_samples(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might need to double check the examples using EllipticEnvelope but if I remember correctly they should be ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
covariance/plot_outlier_detection.py and applications/plot_outlier_detection_housing.py seem ok.
| is_inlier[values <= self.threshold_] = 1 | ||
| values = self.decision_function(X) | ||
| is_inlier[values > 0] = 1 | ||
| else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need this? the contamination parameter is 0.1 by default
| return -self.mahalanobis(X) | ||
|
|
||
| def predict(self, X): | ||
| """Outlyingness of observations in X according to the fitted model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also change this docstring because "outlyingness" could mean that predict returns scores whereas it returns labels.
| assert_raises(NotFittedError, clf.decision_function, X) | ||
| clf.fit(X) | ||
| y_pred = clf.predict(X) | ||
| decision = clf.score_samples(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scores =
| offset_ : float | ||
| Offset used to define the decision function from the raw scores. | ||
| We have the relation: decision_function = score_samples - offset_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the code it is: score_samples + offset_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
corrected
sklearn/neighbors/lof.py
Outdated
| offset_ : float | ||
| Offset used to define the decision function from the raw scores. | ||
| We have the relation: decision_function = score_samples - offset_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the code it is: score_samples + offset_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
corrected
sklearn/neighbors/tests/test_lof.py
Outdated
| assert_array_equal(clf1._score_samples([[2., 2.]]), | ||
| clf1._decision_function([[2., 2.]]) - clf1.offset_) | ||
| assert_array_equal(clf2._score_samples([[2., 2.]]), | ||
| clf2._decision_function([[2., 2.]]) - clf2.offset_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe assert equality of clf1.score_samples and clf2.score_samples
sklearn/svm/classes.py
Outdated
| We have the relation: decision_function = score_samples - offset_. | ||
| The offset is equal to intercept_ and is provided for consistency | ||
| with other outlier detection algorithms such as LocalOutlierFactor, | ||
| IsolationForest and EllipticEnvelope. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove such as LocalOutlierFactor, IsolationForest and EllipticEnvelope. Otherwise if a new outlier detection estimator is added to scikit-learn we would need to update the list.
sklearn/svm/classes.py
Outdated
| return dec | ||
|
|
||
| def score_samples(self, X): | ||
| """Raw decision function of the samples. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im not very pleased with Raw decision function. Maybe Shifted decision function or scoring function?
|
@ngoix as soon as you take into account my review this PR will be ready for final reviews. |
21cbff0 to
78b6032
Compare
|
thanks @albertcthomas for the review |
|
Ping @agramfort for another one? |
jnothman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A partial review.
Could you please add an entry to what's new and be clear on any decision_function or predict behaviour that has changed.
| X : array-like, shape (n_samples, n_features) | ||
| raw_values : bool | ||
| raw_values : bool, optional (default=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think "(default=None)" means anything. Remove it.
| return decision | ||
| # raw_values deprecation: | ||
| if raw_values is not None: | ||
| warnings.warn("raw_values parameter is deprecated in 0.20 and will" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not tested, apparently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test added
| return negative_mahal_dist - self.offset_ | ||
|
|
||
| def score_samples(self, X): | ||
| """Compute the Mahalanobis distances. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
say negative here, please
| Returns -1 for anomalies/outliers and +1 for inliers. | ||
| """ | ||
| check_is_fitted(self, 'threshold_') | ||
| check_is_fitted(self, 'offset_') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't need repeating if done in decision_function.
Should this definition of predict be provided by a mixin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnothman do you mean a mixin for outlier detection estimators? I was thinking that all outlier detection estimators should have a fit_predict like clustering estimators but they can't all have a predict unless we exclude LocalOutlierFactor which has a private predict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. We can make the presence of predict conditional on the presence of decision_function. It's a bit ugly, but possible with a descriptor/property...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose to create such a mixin in a separate PR, as this one is already a bit heavy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will open a mixin PR once this PR is merged.
sklearn/neighbors/lof.py
Outdated
| raise ValueError("contamination must be in (0, 0.5]") | ||
| if self.contamination != "auto": | ||
| if not(0. < self.contamination <= .5): | ||
| raise ValueError("contamination must be in (0, 0.5]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not tested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the actual value of self.contamination in the error message:
ValueError("contamination must be in (0, 0.5],"
"got: %f" % self.contamination)727e597 to
5c7dd33
Compare
|
Thanks @jnothman |
|
@ngoix for the what's new put it on online that will render as tiny paragraph if line is long. Don't itemize it. |
|
Put it all in one paragraph. We can restructure later.
…On 12 September 2017 at 04:59, Alexandre Gramfort ***@***.***> wrote:
@ngoix <https://github.com/ngoix> for the what's new put it on online
that will render as tiny paragraph if line is long. Don't itemize it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9015 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz67SIOTQ8yF1FmwDqhWnXU1b2Ae0Uks5shYMkgaJpZM4NxkWN>
.
|
7da6313 to
8e06afe
Compare
Resolves #8693
Resolves #8707
Following up discussion in issue #8693
Make
IsolationForestandLocalOutlierFactordecision functions usecontaminationasEllipticEnvelopedoes. It is used to define a drift such that the threshold on the decision function for being an outlier is 0 (positive value = inlier, negative value = outlier).Add a
score_samplesmethod to OCSVM, iForest, LOF, EllipticEnvelope, enabling the user to access raw score functions from original papers. A newoffset_parameter allows to link this new method todecision_functionthrough "decision_function = score_samples - offset_".clean
EllipticEnvelope, fix docs, deprecateraw_valuesparameter.Add why lof decision_function is private (mentioned in [MRG] Add decision_function api to LocalOutlierFactor #8707)
ravel the decision function of OCSVM so that it is indeed of shape (n_samples,)