Anomaly detection
D E S I G N I N G M A C H I N E L E A R N I N G W O R K F LO W S I N P Y T H O N
Dr. Chris Anagnostopoulos
Honorary Associate Professor
Anomalies and outliers
Supervised Unsupervised
DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
Anomalies and outliers
One of the two classes is very rare
Extreme case of dataset shift
Examples:
cybersecurity
fraud detection
anti-money laundering
fault detection
DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
Unsupervised work ows
Careful use of a handful of labels:
too few for training without over tting
just enough for model selection
drop unbiased estimate of accuracy
How to t an algorithm without labels?
How to estimate its performance?
DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
Outlier: a datapoint that lies outside the range Local outlier: a datapoint that lies in an
of the majority of the data isolated region without other data
DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
Local outlier factor (LoF)
DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
Local outlier factor (LoF)
from sklearn.neighbors import confusion_matrix(
LocalOutlierFactor as lof y_pred, ground_truth)
clf = lof()
y_pred = clf.fit_predict(X)
array([[ 5, 16],
[ 0, 184]])
y_pred[:4]
array([ 1, 1, 1, -1])
clf.negative_outlier_factor_[:4]
array([-0.99, -1.02, -1.08 , -0.97])
DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
Local outlier factor (LoF)
clf = lof(contamination=0.02)
y_pred = clf.fit_predict(X)
confusion_matrix(
y_pred, ground_truth)
array([[ 5, 0],
[ 0, 200]])
DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
Who needs labels
anyway!
D E S I G N I N G M A C H I N E L E A R N I N G W O R K F LO W S I N P Y T H O N
Novelty detection
D E S I G N I N G M A C H I N E L E A R N I N G W O R K F LO W S I N P Y T H O N
Dr. Chris Anagnostopoulos
Honorary Associate Professor
One-class classi cation
Training data without anomalies: Future / test data with anomalies:
DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
Novelty LoF
Workaround Novelty LoF
preds = lof().fit_predict( clf = lof(novelty=True)
np.concatenate([X_train, X_test]))
clf.fit(X_train)
preds = preds[X_train.shape[0]:] y_pred = clf.predict(X_test)
DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
One-class Support Vector Machine
clf = OneClassSVM()
clf.fit(X_train)
y_pred = clf.predict(X_test)
y_pred[:4]
array([ 1, 1, 1, -1])
DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
One-class Support Vector Machine
clf = OneClassSVM()
clf.fit(X_train)
y_scores = clf.score_samples(X_test)
threshold = np.quantile(y_scores, 0.1)
y_pred = y_scores <= threshold
DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
Isolation Forests
clf = IsolationForest()
clf.fit(X_train)
y_scores = clf.score_samples(X_test)
clf = LocalOutlierFactor(novelty=True)
clf.fit(X_train)
y_scores = clf.score_samples(X_test)
DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
clf_lof = LocalOutlierFactor(novelty=True).fit(X_train)
clf_isf = IsolationForest().fit(X_train)
clf_svm = OneClassSVM().fit(X_train)
roc_auc_score(y_test, clf_lof.score_samples(X_test)
0.9897
roc_auc_score(y_test, clf_isf.score_samples(X_test))
0.9692
roc_auc_score(y_test, clf_svm.score_samples(X_test))
0.9948
DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
clf_lof = LocalOutlierFactor(novelty=True).fit(X_train)
clf_isf = IsolationForest().fit(X_train)
clf_svm = OneClassSVM().fit(X_train)
accuracy_score(y_test, clf_lof.predict(X_test))
0.9318
accuracy_score(y_test, clf_isf.predict(X_test))
0.9545
accuracy_score(y_test, clf_svm.predict(X_test))
0.5
DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
What's new?
D E S I G N I N G M A C H I N E L E A R N I N G W O R K F LO W S I N P Y T H O N
Distance-based
learning
D E S I G N I N G M A C H I N E L E A R N I N G W O R K F LO W S I N P Y T H O N
Dr. Chris Anagnostopoulos
Honorary Associate Professor
Distance and similarity
from sklearn.neighbors import DistanceMetric as dm
dist = dm.get_metric('euclidean')
X = [[0,1], [2,3], [0,6]]
dist.pairwise(X)
array([[0. , 2.82842712, 5. ],
[2.82842712, 0. , 3.60555128],
[5. , 3.60555128, 0. ]])
X = np.matrix(X)
np.sqrt(np.sum(np.square(X[0,:] - X[1,:])))
2.82842712
DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
Non-Euclidean Local Outlier Factor
clf = LocalOutlierFactor(
novelty=True, metric='chebyshev')
clf.fit(X_train)
y_pred = clf.predict(X_test)
dist = dm.get_metric('chebyshev')
X = [[0,1], [2,3], [0,6]]
dist.pairwise(X)
array([[0., 2., 5.],
[2., 0., 3.],
[5., 3., 0.]])
DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
Are all metrics similar?
Hamming distance matrix:
dist = dm.get_metric('hamming')
X = [[0,1], [2,3], [0,6]]
dist.pairwise(X)
array([[0. , 1. , 0.5],
[1. , 0. , 1. ],
[0.5, 1. , 0. ]])
DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
Are all metrics similar?
from scipy.spatial.distance import pdist from scipy.spatial.distance import \
X = [[0,1], [2,3], [0,6]] squareform
pdist(X, 'cityblock') squareform(pdist(X, 'cityblock'))
array([4., 5., 5.]) array([[0., 4., 5.],
[4., 0., 5.],
[5., 5., 0.]])
DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
A real-world example
The Hepatitis dataset:
Class AGE SEX STEROID ...
0 2.0 40.0 0.0 0.0 ...
1 2.0 30.0 0.0 0.0 ...
2 1.0 47.0 0.0 1.0 ...
1 https://archive.ics.uci.edu/ml/datasets/Hepatitis
DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
A real-world example
Euclidean distance: Hamming distance:
squareform(pdist(X_hep, 'euclidean')) squareform(pdist(X_hep, 'hamming'))
[[ 0. 127. 64.1] [[0. 0.5 0.7]
[127. 0. 128.2] [0.5 0. 0.6]
[ 64.1 128.2 0. ]] [0.7 0.6 0. ]]
1 nearest to 3: wrong class 1 nearest to 2: right class
DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
A bigger toolbox
D E S I G N I N G M A C H I N E L E A R N I N G W O R K F LO W S I N P Y T H O N
Unstructured data
D E S I G N I N G M A C H I N E L E A R N I N G W O R K F LO W S I N P Y T H O N
Dr. Chris Anagnostopoulos
Honorary Associate Professor
Structured versus unstructured
Class AGE SEX STEROID ...
0 2.0 50.0 2.0 1.0 ...
1 2.0 40.0 1.0 1.0 ...
...
label sequence
0 VIRUS AVTVVPDPTCCGTLSFKVPKDAKKGKHLGTFDIRQAIMDYGGLHSQ...
1 IMMUNE SYSTEM QVQLQQPGAELVKPGASVKLSCKASGYTFTSYWMHWVKQRPGRGLE...
2 IMMUNE SYSTEM QAVVTQESALTTSPGETVTLTCRSSTGAVTTSNYANWVQEKPDHLF...
3 VIRUS MSQVTEQSVRFQTALASIKLIQASAVLDLTEDDFDFLTSNKVWIAT...
...
Can we build a detector that ags viruses as anomalous in this data?
DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
import stringdist
stringdist.levenshtein('abc', 'acc')
stringdist.levenshtein('acc', 'cce')
label sequence
169 IMMUNE SYSTEM ILSALVGIV
170 IMMUNE SYSTEM ILSALVGIL
stringdist.levenshtein('ILSALVGIV', 'ILSALVGIL')
DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
Some debugging
# This won't work
pdist(proteins['sequence'].iloc[:3], metric=stringdist.levenshtein)
Traceback (most recent call last):
ValueError: A 2-dimensional array must be passed.
DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
Some debugging
sequences = np.array(proteins['sequence'].iloc[:3]).reshape(-1,1)
# This won't work for a different reason
pdist(sequences, metric=stringdist.levenshtein)
Traceback (most recent call last):
TypeError: argument 1 must be str, not numpy.ndarray
DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
Some debugging
# This one works!!
def my_levenshtein(x, y):
return stringdist.levenshtein(x[0], y[0])
pdist(sequences, metric=my_levenshtein)
array([136., 2., 136.])
DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
Protein outliers with precomputed matrices
# This takes 2 minutes for about 1000 examples
M = pdist(sequences, my_levenshtein)
LoF detector with a precomputed distance matrix:
# This takes 3 seconds
detector = lof(metric='precomputed', contamination=0.1)
preds = detector.fit_predict(M)
roc_auc_score(proteins['label'] == 'VIRUS', preds == -1)
0.64
DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
Pick your distance
D E S I G N I N G M A C H I N E L E A R N I N G W O R K F LO W S I N P Y T H O N
Concluding remarks
D E S I G N I N G M A C H I N E L E A R N I N G W O R K F LO W S I N P Y T H O N
Dr. Chris Anagnostopoulos
Honorary Associate Professor
Concluding remarks
Refresher of supervised learning pipelines:
feature engineering
model tting
model selection
Risks of over tting
Data fusion
Noisy labels and heuristics
Loss functions
costs of false positives vs costs of false negatives
DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
Concluding remarks
Unsupervised learning:
anomaly detection
novelty detection
distance metrics
unstructured data
Real-world use cases:
cybersecurity
healthcare
retail banking
DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
Congratulations!
D E S I G N I N G M A C H I N E L E A R N I N G W O R K F LO W S I N P Y T H O N