[MRG+2] LOF algorithm (Anomaly Detection)#5279
[MRG+2] LOF algorithm (Anomaly Detection)#5279amueller merged 18 commits intoscikit-learn:masterfrom
Conversation
|
thanks for the early PR let me know when you need a review ie when you addressed the standard things (tests, example, some basic doc) |
|
I'd also be interested in reviewing this when you've moved past the WIP stage. |
|
I think it is ready for a first review @agramfort @jmschrei ! |
sklearn/neighbors/lof.py
Outdated
There was a problem hiding this comment.
why do you need a mixin here?
I would put all methods from Mixin into LOF class and make them private.
There was a problem hiding this comment.
I agree. Mixins imply that multiple estimators will be using them,
|
I would like to see some more extensive unit tests, particularly in cases where the algorithm should fail (wrong dimensions or other incorrect types of data passed in). I'll be able to look more at the performance of the code once you merge the mixin with the other class, and change the API to always take in an X matrix. |
|
I'd also like to see an example of it performing against a/many current algorithm(s), so that it is clear it is a valuable contribution. |
|
If you have a dataset X and want to remove outliers from it, you don't want to do because then each sample is considered in its own neighbourhoud: in predict(X), X is considered as 'new observations'. What the user wants is: which is allowed by It is like looking for k-nearest-neighbors of points in a dataset X: you can do: which is different from I can make
and allows taking X=None in argument... Is it allowed ? |
implement a fit_predict(X) method is the way to go. |
|
Ok thanks ! |
|
I merged the mixin with LOF class, changed the API and added a comparison example. |
|
done! |
|
@amueller want to take a final look? for me it's good enough to merge |
amueller
left a comment
There was a problem hiding this comment.
I think caching the LRD on the training set would be good (and actually make the code easier to follow). I think either predict and decision_function should both be private or neither. I kinda tend towards both, as making public is easier than hiding.
The rest is mostly minor, though how to tune n_neighbors seems pretty important.
sklearn/neighbors/lof.py
Outdated
| Returns | ||
| ------- | ||
| lof_scores : array, shape (n_samples,) | ||
| The Local Outlier Factor of each input samples. The lower, |
There was a problem hiding this comment.
This seems to contradict the title of the docstring.
There was a problem hiding this comment.
Yes it is -lof_scores
sklearn/neighbors/lof.py
Outdated
| return is_inlier | ||
|
|
||
| def decision_function(self, X): | ||
| """Opposite of the Local Outlier Factor of X (as bigger is better). |
There was a problem hiding this comment.
I think the docstring should be more explicit. Is low outlier or high outlier?
Actually, to be consistent with the other estimators, I think negative needs to be outlier.
There was a problem hiding this comment.
I don't think so, for all the decision functions, bigger is better (large values correspond to inliers). For prediction, negative values (-1) correspond to outliers though. (It's true that this is a bit odd)
| hence more adapted to large-dimensional settings, even if it performs | ||
| quite well in the examples below. | ||
|
|
||
| - using the Local Outlier Factor to measure the local deviation of a given |
There was a problem hiding this comment.
It's kinda odd that this example lives in this folder... but whatever..
There was a problem hiding this comment.
Yes very weird! it is the folder of the first outlier detection algorithm in scikit-learn.
sklearn/neighbors/lof.py
Outdated
|
|
||
| # Avoid re-computing X_lrd if same parameters: | ||
| if not (np.all(distances_X == self._distances_fit_X_) * | ||
| np.all(self._neighbors_indices_fit_X_ == neighbors_indices_X)): |
There was a problem hiding this comment.
this == raises a deprecation warning
/home/andy/checkout/scikit-learn/sklearn/neighbors/lof.py:279: DeprecationWarning: elementwise == comparison failed; this will raise an error in the future.
np.all(self.neighbors_indices_fit_X == neighbors_indices_X)):
This means they have different size, I think. So I guess you should check the shape first?
Also happens for the line above.
| The question is not, how isolated the sample is, but how isolated it is | ||
| with respect to the surrounding neighborhood. | ||
|
|
||
| This strategy is illustrated below. |
There was a problem hiding this comment.
I don't feel that the example illustrates the point that was just made about the different densities. I'm fine to leave it as-is but I don't get a good idea of the global vs local. It would be nice to also illustrate a failure mode maybe?
There was a problem hiding this comment.
No global vs local anymore!
sklearn/neighbors/lof.py
Outdated
| # Avoid re-computing X_lrd if same parameters: | ||
| if not (np.all(distances_X == self._distances_fit_X_) * | ||
| np.all(self._neighbors_indices_fit_X_ == neighbors_indices_X)): | ||
| lrd = self._local_reachability_density( |
There was a problem hiding this comment.
It seems that lrd is "small" compared to _distances_fit_X_ and _neighbors_indices_fit_X_. Why not compute it in fit and store it once and for all? You are currently recomputing it on every call to _local_outlier_factor.
sklearn/neighbors/lof.py
Outdated
| Parameters | ||
| ---------- | ||
| distances_X : array, shape (n_query, self.n_neighbors) | ||
| Distances to the neighbors (in the training samples self._fit_X) of |
There was a problem hiding this comment.
I would put backticks around _fit_X to be save ;)
There was a problem hiding this comment.
Do you mean replacing self._fit_X by self._fit_X or self._fit_X or just by _fit_X? I don't understand the purpose...
There was a problem hiding this comment.
I meant putting backticks around self._fit_X. a) for nicer highlighting b) I'm not sure sphinx will render the current version correctly because of the underscore. But I might be paranoid.
sklearn/neighbors/tests/test_lof.py
Outdated
| score = clf.fit(X).outlier_factor_ | ||
| assert_array_equal(clf._fit_X, X) | ||
|
|
||
| # Assert scores are good: |
There was a problem hiding this comment.
Assert smallest outlier score is is greater than largest inlier score
sklearn/neighbors/tests/test_lof.py
Outdated
| clf = neighbors.LocalOutlierFactor().fit(X_train) | ||
|
|
||
| # predict scores (the lower, the more normal) | ||
| y_pred = - clf.decision_function(X_test) |
There was a problem hiding this comment.
I would find it more natural to give the outliers the negative label. If you want to leave it like this, remove space after -
There was a problem hiding this comment.
I agree but this is to be consistent with OneClassSVM, EllipticEnvelop and IsolationForest.
sklearn/neighbors/lof.py
Outdated
| distance between them. This works for Scipy's metrics, but is less | ||
| efficient than passing the metric name as a string. | ||
|
|
||
| Distance matrices are not supported. |
There was a problem hiding this comment.
I don't understand this comment.
95036df to
662ed9c
Compare
sklearn/neighbors/tests/test_lof.py
Outdated
| clf = neighbors.LocalOutlierFactor().fit(X_train) | ||
|
|
||
| # predict scores (the lower, the more normal) | ||
| y_pred = -clf.decision_function(X_test) |
There was a problem hiding this comment.
I meant changing y_test to be [0] * 20 + [-1] * 20 and then remove the -
|
|
||
| return self | ||
|
|
||
| def _predict(self, X=None): |
There was a problem hiding this comment.
I would really like to be consistent. I don't think there's a good argument to have one but not the other. Not sure if the example is a strong enough point to make them both public.
| # fit the data and tag outliers | ||
| clf.fit(X) | ||
| scores_pred = clf.decision_function(X) | ||
| if clf_name == "Local Outlier Factor": |
There was a problem hiding this comment.
Wait, I don't understand this. Please elaborate.
sklearn/neighbors/lof.py
Outdated
| Attributes | ||
| ---------- | ||
| outlier_factor_ : numpy array, shape (n_samples,) | ||
| The LOF of X. The lower, the more normal. |
There was a problem hiding this comment.
I don't know which comment of yours refers to which comment of mine.
For the first comment: Yes, I'd either do negative_outlier_factor or inlier_score or something generic?
For the second comment: The explanation of outlier_factor_ as an attribute says "The LOF of X" what is X? It's the training set this LocalOutlierFactor estimator was trained on, right?
|
thanks :) |
|
Hurray 🍻 Thanks @ngoix !! |
|
Youpi 🍻 !! |
|
Thanks @ngoix !! |
|
Merged #5279.
Whoot!!
Thanks to everybody involved.
|
|
Congrats ! |
* LOF algorithm add tests and example fix DepreciationWarning by reshape(1,-1) one-sample data LOF with inheritance lof and lof2 return same score fix bugs fix bugs optimized and cosmit rm lof2 cosmit rm MixinLOF + fit_predict fix travis - optimize pairwise_distance like in KNeighborsMixin.kneighbors add comparison example + doc LOF -> LocalOutlierFactor cosmit change LOF API: -fit(X).predict() and fit(X).decision_function() do prediction on X without considering samples as their own neighbors (ie without considering X as a new dataset as does fit(X).predict(X)) -rm fit_predict() method -add a contamination parameter st predict returns a binary value like other anomaly detection algos cosmit doc + debug example correction doc pass on doc + examples pep8 + fix warnings first attempt at fixing API issues minor changes takes into account tguillemot advice -remove pairwise_distance calculation as to heavy in memory -add benchmarks cosmit minor changes + deals with duplicates fix depreciation warnings * factorize the two for loops * take into account @albertthomas88 review and cosmit * fix doc * alex review + rebase * make predict private add outlier_factor_ attribute and update tests * make fit_predict take y argument * fix benchmarks file * update examples * make decision_function public (rm X=None default) * fix travis * take into account tguillemot review + remove useless k_distance function * fix broken links :meth:`kneighbors` * cosmit * whatsnew * amueller review + remove _local_outlier_factor method * add n_neighbors_ parameter the effective nb neighbors we use * make decision_function private and negative_outlier_factor attribute
* LOF algorithm add tests and example fix DepreciationWarning by reshape(1,-1) one-sample data LOF with inheritance lof and lof2 return same score fix bugs fix bugs optimized and cosmit rm lof2 cosmit rm MixinLOF + fit_predict fix travis - optimize pairwise_distance like in KNeighborsMixin.kneighbors add comparison example + doc LOF -> LocalOutlierFactor cosmit change LOF API: -fit(X).predict() and fit(X).decision_function() do prediction on X without considering samples as their own neighbors (ie without considering X as a new dataset as does fit(X).predict(X)) -rm fit_predict() method -add a contamination parameter st predict returns a binary value like other anomaly detection algos cosmit doc + debug example correction doc pass on doc + examples pep8 + fix warnings first attempt at fixing API issues minor changes takes into account tguillemot advice -remove pairwise_distance calculation as to heavy in memory -add benchmarks cosmit minor changes + deals with duplicates fix depreciation warnings * factorize the two for loops * take into account @albertthomas88 review and cosmit * fix doc * alex review + rebase * make predict private add outlier_factor_ attribute and update tests * make fit_predict take y argument * fix benchmarks file * update examples * make decision_function public (rm X=None default) * fix travis * take into account tguillemot review + remove useless k_distance function * fix broken links :meth:`kneighbors` * cosmit * whatsnew * amueller review + remove _local_outlier_factor method * add n_neighbors_ parameter the effective nb neighbors we use * make decision_function private and negative_outlier_factor attribute
* LOF algorithm add tests and example fix DepreciationWarning by reshape(1,-1) one-sample data LOF with inheritance lof and lof2 return same score fix bugs fix bugs optimized and cosmit rm lof2 cosmit rm MixinLOF + fit_predict fix travis - optimize pairwise_distance like in KNeighborsMixin.kneighbors add comparison example + doc LOF -> LocalOutlierFactor cosmit change LOF API: -fit(X).predict() and fit(X).decision_function() do prediction on X without considering samples as their own neighbors (ie without considering X as a new dataset as does fit(X).predict(X)) -rm fit_predict() method -add a contamination parameter st predict returns a binary value like other anomaly detection algos cosmit doc + debug example correction doc pass on doc + examples pep8 + fix warnings first attempt at fixing API issues minor changes takes into account tguillemot advice -remove pairwise_distance calculation as to heavy in memory -add benchmarks cosmit minor changes + deals with duplicates fix depreciation warnings * factorize the two for loops * take into account @albertthomas88 review and cosmit * fix doc * alex review + rebase * make predict private add outlier_factor_ attribute and update tests * make fit_predict take y argument * fix benchmarks file * update examples * make decision_function public (rm X=None default) * fix travis * take into account tguillemot review + remove useless k_distance function * fix broken links :meth:`kneighbors` * cosmit * whatsnew * amueller review + remove _local_outlier_factor method * add n_neighbors_ parameter the effective nb neighbors we use * make decision_function private and negative_outlier_factor attribute
* LOF algorithm add tests and example fix DepreciationWarning by reshape(1,-1) one-sample data LOF with inheritance lof and lof2 return same score fix bugs fix bugs optimized and cosmit rm lof2 cosmit rm MixinLOF + fit_predict fix travis - optimize pairwise_distance like in KNeighborsMixin.kneighbors add comparison example + doc LOF -> LocalOutlierFactor cosmit change LOF API: -fit(X).predict() and fit(X).decision_function() do prediction on X without considering samples as their own neighbors (ie without considering X as a new dataset as does fit(X).predict(X)) -rm fit_predict() method -add a contamination parameter st predict returns a binary value like other anomaly detection algos cosmit doc + debug example correction doc pass on doc + examples pep8 + fix warnings first attempt at fixing API issues minor changes takes into account tguillemot advice -remove pairwise_distance calculation as to heavy in memory -add benchmarks cosmit minor changes + deals with duplicates fix depreciation warnings * factorize the two for loops * take into account @albertthomas88 review and cosmit * fix doc * alex review + rebase * make predict private add outlier_factor_ attribute and update tests * make fit_predict take y argument * fix benchmarks file * update examples * make decision_function public (rm X=None default) * fix travis * take into account tguillemot review + remove useless k_distance function * fix broken links :meth:`kneighbors` * cosmit * whatsnew * amueller review + remove _local_outlier_factor method * add n_neighbors_ parameter the effective nb neighbors we use * make decision_function private and negative_outlier_factor attribute
Local Outlier Factor implementation.
Motivated by previous discussions
http://sourceforge.net/p/scikit-learn/mailman/message/32485020
and
#4163
benchmarks of LOF on:
https://github.com/ngoix/scikit-learn/blob/AD_benchmarks/benchmarks/bench_lof_vs_iforest.py