0% found this document useful (0 votes)

200 views38 pages

Designing Machine Learning Workflows in Python Chapter4

The document discusses anomaly detection techniques for both structured and unstructured data. It covers common supervised and unsupervised anomaly detection algorithms like Local Outlier Factor (LOF) and One-Class SVM. It also discusses how different distance metrics like Euclidean, Hamming, and Levenshtein distance can impact anomaly detection performance on structured data. Finally, it provides an example of how string distance metrics could be used for anomaly detection on unstructured sequence data.

Uploaded by

Fgpeqw

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

200 views38 pages

Designing Machine Learning Workflows in Python Chapter4

Uploaded by

Fgpeqw

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

Anomaly detection

D E S I G N I N G M A C H I N E L E A R N I N G W O R K F LO W S I N P Y T H O N

Dr. Chris Anagnostopoulos

Honorary Associate Professor
Anomalies and outliers
Supervised Unsupervised

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Anomalies and outliers
One of the two classes is very rare

Extreme case of dataset shift

Examples:
cybersecurity

fraud detection

anti-money laundering

fault detection

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Unsupervised work ows
Careful use of a handful of labels:

too few for training without over tting

just enough for model selection

drop unbiased estimate of accuracy

How to t an algorithm without labels?

How to estimate its performance?

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Outlier: a datapoint that lies outside the range Local outlier: a datapoint that lies in an
of the majority of the data isolated region without other data

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Local outlier factor (LoF)

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Local outlier factor (LoF)
from sklearn.neighbors import confusion_matrix(
LocalOutlierFactor as lof y_pred, ground_truth)
clf = lof()
y_pred = clf.fit_predict(X)
array([[ 5, 16],
[ 0, 184]])
y_pred[:4]

array([ 1, 1, 1, -1])

clf.negative_outlier_factor_[:4]

array([-0.99, -1.02, -1.08 , -0.97])

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Local outlier factor (LoF)
clf = lof(contamination=0.02)
y_pred = clf.fit_predict(X)

confusion_matrix(
y_pred, ground_truth)

array([[ 5, 0],
[ 0, 200]])

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Who needs labels
anyway!
D E S I G N I N G M A C H I N E L E A R N I N G W O R K F LO W S I N P Y T H O N
Novelty detection
D E S I G N I N G M A C H I N E L E A R N I N G W O R K F LO W S I N P Y T H O N

Dr. Chris Anagnostopoulos

Honorary Associate Professor
One-class classi cation
Training data without anomalies: Future / test data with anomalies:

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Novelty LoF
Workaround Novelty LoF

preds = lof().fit_predict( clf = lof(novelty=True)

np.concatenate([X_train, X_test]))
clf.fit(X_train)
preds = preds[X_train.shape[0]:] y_pred = clf.predict(X_test)

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

One-class Support Vector Machine
clf = OneClassSVM()

clf.fit(X_train)
y_pred = clf.predict(X_test)

y_pred[:4]

array([ 1, 1, 1, -1])

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

One-class Support Vector Machine
clf = OneClassSVM()
clf.fit(X_train)
y_scores = clf.score_samples(X_test)

threshold = np.quantile(y_scores, 0.1)

y_pred = y_scores <= threshold

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Isolation Forests
clf = IsolationForest()
clf.fit(X_train)
y_scores = clf.score_samples(X_test)

clf = LocalOutlierFactor(novelty=True)
clf.fit(X_train)
y_scores = clf.score_samples(X_test)

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

clf_lof = LocalOutlierFactor(novelty=True).fit(X_train)
clf_isf = IsolationForest().fit(X_train)
clf_svm = OneClassSVM().fit(X_train)

roc_auc_score(y_test, clf_lof.score_samples(X_test)

0.9897

roc_auc_score(y_test, clf_isf.score_samples(X_test))

0.9692

roc_auc_score(y_test, clf_svm.score_samples(X_test))

0.9948

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

clf_lof = LocalOutlierFactor(novelty=True).fit(X_train)
clf_isf = IsolationForest().fit(X_train)
clf_svm = OneClassSVM().fit(X_train)

accuracy_score(y_test, clf_lof.predict(X_test))

0.9318

accuracy_score(y_test, clf_isf.predict(X_test))

0.9545

accuracy_score(y_test, clf_svm.predict(X_test))

0.5

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

What's new?
D E S I G N I N G M A C H I N E L E A R N I N G W O R K F LO W S I N P Y T H O N
Distance-based
learning
D E S I G N I N G M A C H I N E L E A R N I N G W O R K F LO W S I N P Y T H O N

Dr. Chris Anagnostopoulos

Honorary Associate Professor
Distance and similarity
from sklearn.neighbors import DistanceMetric as dm
dist = dm.get_metric('euclidean')

X = [[0,1], [2,3], [0,6]]

dist.pairwise(X)

array([[0. , 2.82842712, 5. ],
[2.82842712, 0. , 3.60555128],
[5. , 3.60555128, 0. ]])

X = np.matrix(X)
np.sqrt(np.sum(np.square(X[0,:] - X[1,:])))

2.82842712

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Non-Euclidean Local Outlier Factor
clf = LocalOutlierFactor(
novelty=True, metric='chebyshev')
clf.fit(X_train)
y_pred = clf.predict(X_test)

dist = dm.get_metric('chebyshev')
X = [[0,1], [2,3], [0,6]]
dist.pairwise(X)

array([[0., 2., 5.],

[2., 0., 3.],
[5., 3., 0.]])

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Are all metrics similar?
Hamming distance matrix:

dist = dm.get_metric('hamming')
X = [[0,1], [2,3], [0,6]]
dist.pairwise(X)

array([[0. , 1. , 0.5],
[1. , 0. , 1. ],
[0.5, 1. , 0. ]])

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Are all metrics similar?
from scipy.spatial.distance import pdist from scipy.spatial.distance import \

X = [[0,1], [2,3], [0,6]] squareform

pdist(X, 'cityblock') squareform(pdist(X, 'cityblock'))

array([4., 5., 5.]) array([[0., 4., 5.],

[4., 0., 5.],
[5., 5., 0.]])

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

A real-world example
The Hepatitis dataset:

Class AGE SEX STEROID ...

0 2.0 40.0 0.0 0.0 ...
1 2.0 30.0 0.0 0.0 ...
2 1.0 47.0 0.0 1.0 ...

1 https://archive.ics.uci.edu/ml/datasets/Hepatitis

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

A real-world example
Euclidean distance: Hamming distance:

squareform(pdist(X_hep, 'euclidean')) squareform(pdist(X_hep, 'hamming'))

[[ 0. 127. 64.1] [[0. 0.5 0.7]

[127. 0. 128.2] [0.5 0. 0.6]
[ 64.1 128.2 0. ]] [0.7 0.6 0. ]]

1 nearest to 3: wrong class 1 nearest to 2: right class

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

A bigger toolbox
D E S I G N I N G M A C H I N E L E A R N I N G W O R K F LO W S I N P Y T H O N
Unstructured data
D E S I G N I N G M A C H I N E L E A R N I N G W O R K F LO W S I N P Y T H O N

Dr. Chris Anagnostopoulos

Honorary Associate Professor
Structured versus unstructured
Class AGE SEX STEROID ...
0 2.0 50.0 2.0 1.0 ...
1 2.0 40.0 1.0 1.0 ...
...

label sequence
0 VIRUS AVTVVPDPTCCGTLSFKVPKDAKKGKHLGTFDIRQAIMDYGGLHSQ...
1 IMMUNE SYSTEM QVQLQQPGAELVKPGASVKLSCKASGYTFTSYWMHWVKQRPGRGLE...
2 IMMUNE SYSTEM QAVVTQESALTTSPGETVTLTCRSSTGAVTTSNYANWVQEKPDHLF...
3 VIRUS MSQVTEQSVRFQTALASIKLIQASAVLDLTEDDFDFLTSNKVWIAT...
...

Can we build a detector that ags viruses as anomalous in this data?

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

import stringdist
stringdist.levenshtein('abc', 'acc')

stringdist.levenshtein('acc', 'cce')

label sequence
169 IMMUNE SYSTEM ILSALVGIV
170 IMMUNE SYSTEM ILSALVGIL

stringdist.levenshtein('ILSALVGIV', 'ILSALVGIL')

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Some debugging
# This won't work
pdist(proteins['sequence'].iloc[:3], metric=stringdist.levenshtein)

Traceback (most recent call last):

ValueError: A 2-dimensional array must be passed.

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Some debugging
sequences = np.array(proteins['sequence'].iloc[:3]).reshape(-1,1)

# This won't work for a different reason

pdist(sequences, metric=stringdist.levenshtein)

Traceback (most recent call last):

TypeError: argument 1 must be str, not numpy.ndarray

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Some debugging
# This one works!!
def my_levenshtein(x, y):
return stringdist.levenshtein(x[0], y[0])

pdist(sequences, metric=my_levenshtein)

array([136., 2., 136.])

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Protein outliers with precomputed matrices
# This takes 2 minutes for about 1000 examples
M = pdist(sequences, my_levenshtein)

LoF detector with a precomputed distance matrix:

# This takes 3 seconds

detector = lof(metric='precomputed', contamination=0.1)
preds = detector.fit_predict(M)

roc_auc_score(proteins['label'] == 'VIRUS', preds == -1)

0.64

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Pick your distance
D E S I G N I N G M A C H I N E L E A R N I N G W O R K F LO W S I N P Y T H O N
Concluding remarks
D E S I G N I N G M A C H I N E L E A R N I N G W O R K F LO W S I N P Y T H O N

Dr. Chris Anagnostopoulos

Honorary Associate Professor
Concluding remarks
Refresher of supervised learning pipelines:
feature engineering

model tting

model selection

Risks of over tting

Data fusion

Noisy labels and heuristics

Loss functions
costs of false positives vs costs of false negatives

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Concluding remarks
Unsupervised learning:
anomaly detection

novelty detection

distance metrics

unstructured data

Real-world use cases:

cybersecurity

healthcare

retail banking

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Congratulations!
D E S I G N I N G M A C H I N E L E A R N I N G W O R K F LO W S I N P Y T H O N

ML Workflows for Cybersecurity
No ratings yet
ML Workflows for Cybersecurity
39 pages
Designing ML Workflows in Python
No ratings yet
Designing ML Workflows in Python
42 pages
Designing Machine Learning Workflows in Python Chapter1
No ratings yet
Designing Machine Learning Workflows in Python Chapter1
32 pages
Python Functions for Audio Transcription
No ratings yet
Python Functions for Audio Transcription
46 pages
Python SpeechRecognition Guide
No ratings yet
Python SpeechRecognition Guide
23 pages
Spoken Language Processing in Python Chapter3
No ratings yet
Spoken Language Processing in Python Chapter3
26 pages
Time-Series Visualization with Matplotlib
No ratings yet
Time-Series Visualization with Matplotlib
27 pages
Relational Plots and Subplots in Seaborn
No ratings yet
Relational Plots and Subplots in Seaborn
38 pages
IoT Data Analysis with Python
No ratings yet
IoT Data Analysis with Python
34 pages
Seaborn Data Visualization Guide
No ratings yet
Seaborn Data Visualization Guide
26 pages
Seaborn Categorical Plot Guide
100% (1)
Seaborn Categorical Plot Guide
32 pages
ConvNet Insights for Tech Enthusiasts
No ratings yet
ConvNet Insights for Tech Enthusiasts
7 pages
List Comprehension in Python
No ratings yet
List Comprehension in Python
8 pages
Analyzing IoT Data in Python Chapter3
No ratings yet
Analyzing IoT Data in Python Chapter3
30 pages
Audio Processing in Python Guide
No ratings yet
Audio Processing in Python Guide
17 pages
Practical R Programming Guide
No ratings yet
Practical R Programming Guide
103 pages
Introduction To Data Visualization With Python
No ratings yet
Introduction To Data Visualization With Python
47 pages
270+ Python Machine Learning Projects
100% (1)
270+ Python Machine Learning Projects
15 pages
Customer Churn Prediction Analysis
100% (1)
Customer Churn Prediction Analysis
3 pages
Aspiring Data Scientist Guide
No ratings yet
Aspiring Data Scientist Guide
10 pages
ML0101EN Clus K Means Customer Seg Py v1
100% (1)
ML0101EN Clus K Means Customer Seg Py v1
8 pages
Machine Learning Basics Stanford Notes
No ratings yet
Machine Learning Basics Stanford Notes
15 pages
Top 9 Data Science Algorithms
No ratings yet
Top 9 Data Science Algorithms
152 pages
Analyzing IoT Data in Python Chapter1
100% (1)
Analyzing IoT Data in Python Chapter1
27 pages
Essential Python Libraries for Data Science
No ratings yet
Essential Python Libraries for Data Science
12 pages
Pandas DataFrame Basics Cheatsheet
No ratings yet
Pandas DataFrame Basics Cheatsheet
3 pages
Python Seaborn Notes
No ratings yet
Python Seaborn Notes
28 pages
05 Logistic - Regression
No ratings yet
05 Logistic - Regression
7 pages
H2o Training Day
No ratings yet
H2o Training Day
180 pages
Data Science Cheat Sheets
100% (1)
Data Science Cheat Sheets
1 page
Database Management Systems by Raghu Ramakrishnan: Special Features of Book
No ratings yet
Database Management Systems by Raghu Ramakrishnan: Special Features of Book
3 pages
CEC453 Machine Learning
No ratings yet
CEC453 Machine Learning
168 pages
Statistics Machine Learning Python Draft
100% (1)
Statistics Machine Learning Python Draft
333 pages
AML 04 Backpropagation
100% (1)
AML 04 Backpropagation
26 pages
Word2Vec Tutorial - The Skip-Gram Model Chris McCormick PDF
No ratings yet
Word2Vec Tutorial - The Skip-Gram Model Chris McCormick PDF
39 pages
7 Classification
100% (3)
7 Classification
63 pages
ARIMA Models for Seasonal Time Series
100% (1)
ARIMA Models for Seasonal Time Series
50 pages
365 Data Science R Course Notes
No ratings yet
365 Data Science R Course Notes
20 pages
Logistic Regression
No ratings yet
Logistic Regression
24 pages
Python Data Science 3 Books in 1 - Hands On Learning For Beginners A Hands-On Guide Beyond The Basics A Hands-On Guide For Experts
No ratings yet
Python Data Science 3 Books in 1 - Hands On Learning For Beginners A Hands-On Guide Beyond The Basics A Hands-On Guide For Experts
358 pages
Building Chatbots in Python Chapter2 PDF
No ratings yet
Building Chatbots in Python Chapter2 PDF
41 pages
Business Requirements Document Template
No ratings yet
Business Requirements Document Template
11 pages
MachineLearningNotes PDF
100% (1)
MachineLearningNotes PDF
299 pages
Machine Learning Algorithms
No ratings yet
Machine Learning Algorithms
9 pages
Bias-Variance Tradeoff in ML Interviews
No ratings yet
Bias-Variance Tradeoff in ML Interviews
46 pages
30 Amazing Machine Learning Projects For The Past Year (v.2018)
No ratings yet
30 Amazing Machine Learning Projects For The Past Year (v.2018)
22 pages
Amazon Fine Food Reviews Dataset Overview
No ratings yet
Amazon Fine Food Reviews Dataset Overview
1 page
Handling Missing Values in KNIME
No ratings yet
Handling Missing Values in KNIME
162 pages
Advanced Plotting
No ratings yet
Advanced Plotting
49 pages
Data Science Design
No ratings yet
Data Science Design
299 pages
Python Programming Guide
No ratings yet
Python Programming Guide
211 pages
Xgboost PDF
100% (1)
Xgboost PDF
128 pages
Introduction To Python For Data Science - Syllabus
100% (1)
Introduction To Python For Data Science - Syllabus
5 pages
Lecture 4-5
No ratings yet
Lecture 4-5
48 pages
PyTorch Neural Network Classifcation
No ratings yet
PyTorch Neural Network Classifcation
1 page
Slides On DataI
No ratings yet
Slides On DataI
33 pages
הרצאה-Classifiers and Decision Trees
No ratings yet
הרצאה-Classifiers and Decision Trees
119 pages
CS178 Winter 2017 Homework 1 Guide
No ratings yet
CS178 Winter 2017 Homework 1 Guide
4 pages
Machine Learning Lecture1 - 26-27 Aug
No ratings yet
Machine Learning Lecture1 - 26-27 Aug
30 pages
Machine Learning Introduction
No ratings yet
Machine Learning Introduction
56 pages
Data Visualization with Matplotlib
No ratings yet
Data Visualization with Matplotlib
35 pages
Chapter3 PDF
No ratings yet
Chapter3 PDF
36 pages
Customize Seaborn Plot Styles and Colors
No ratings yet
Customize Seaborn Plot Styles and Colors
54 pages
Data Visualization with Matplotlib
No ratings yet
Data Visualization with Matplotlib
30 pages
Chapter1 PDF
No ratings yet
Chapter1 PDF
37 pages
Customer Segmentation in Python Chapter3
No ratings yet
Customer Segmentation in Python Chapter3
25 pages
RFM Customer Segmentation in Python
No ratings yet
RFM Customer Segmentation in Python
33 pages
Customer Segmentation in Python Chapter4
No ratings yet
Customer Segmentation in Python Chapter4
37 pages
Credit Risk Modeling for Data Scientists
100% (1)
Credit Risk Modeling for Data Scientists
35 pages
Credit Risk Modeling in Python Chapter3
No ratings yet
Credit Risk Modeling in Python Chapter3
35 pages
Cleaning Data With PySpark Chapter4
No ratings yet
Cleaning Data With PySpark Chapter4
23 pages
Credit Risk Modeling in Python Chapter2
100% (1)
Credit Risk Modeling in Python Chapter2
36 pages
Building Chatbots in Python Chapter4
No ratings yet
Building Chatbots in Python Chapter4
20 pages
PySpark Caching and Performance Tips
No ratings yet
PySpark Caching and Performance Tips
25 pages
PySpark Data Cleaning Guide
0% (1)
PySpark Data Cleaning Guide
20 pages
PySpark DataFrame Operations Guide
100% (1)
PySpark DataFrame Operations Guide
25 pages
Boosted Convolutional Neural Network For Real Time Facial Expression Recognition
No ratings yet
Boosted Convolutional Neural Network For Real Time Facial Expression Recognition
4 pages
Robotic Process Automation Special Edition PDF
100% (1)
Robotic Process Automation Special Edition PDF
48 pages
Introduction to Natural Language Processing
100% (1)
Introduction to Natural Language Processing
12 pages
Sarvesh Ashwinrao Bonde AssessmentCenterReport 163
No ratings yet
Sarvesh Ashwinrao Bonde AssessmentCenterReport 163
5 pages
A Decentralized Approach To Threat
No ratings yet
A Decentralized Approach To Threat
20 pages
Statistical Machine Learning 1665832214
No ratings yet
Statistical Machine Learning 1665832214
55 pages
Abhinav CV
No ratings yet
Abhinav CV
2 pages
Academic Profile: Eric C. Chi
No ratings yet
Academic Profile: Eric C. Chi
15 pages
ABSTRACT Robotic Arm-Based Object Sorting System
No ratings yet
ABSTRACT Robotic Arm-Based Object Sorting System
12 pages
Understanding Business Analytics Types
No ratings yet
Understanding Business Analytics Types
34 pages
Heart Disease Prediction Interview QA
No ratings yet
Heart Disease Prediction Interview QA
2 pages
RLHF Challenges and Limitations
No ratings yet
RLHF Challenges and Limitations
38 pages
Instant Download Developing Kaggle Notebooks Gabriel Preda PDF All Chapter
100% (3)
Instant Download Developing Kaggle Notebooks Gabriel Preda PDF All Chapter
54 pages
AI Screening for Oropharyngeal Dysphagia
No ratings yet
AI Screening for Oropharyngeal Dysphagia
11 pages
Consumer Behavior Analytics with ML
No ratings yet
Consumer Behavior Analytics with ML
3 pages
A Survey of Predictive Maintenance and Self-Optimization in Telecom Field Based On Machine Learning
No ratings yet
A Survey of Predictive Maintenance and Self-Optimization in Telecom Field Based On Machine Learning
6 pages
Basics of ANN
No ratings yet
Basics of ANN
16 pages
Mitigating Bias in Algorithmic Hiring: Evaluating Claims and Practices
No ratings yet
Mitigating Bias in Algorithmic Hiring: Evaluating Claims and Practices
13 pages
Science & Tech Insights 2023
No ratings yet
Science & Tech Insights 2023
112 pages
The Role of Artificial Intelligence in Ensuring The Cyber Security of Scada Systems
No ratings yet
The Role of Artificial Intelligence in Ensuring The Cyber Security of Scada Systems
10 pages
How To Run Cluster Analysis in Excel
No ratings yet
How To Run Cluster Analysis in Excel
9 pages
Deep Learning Course for MSEE Students
No ratings yet
Deep Learning Course for MSEE Students
3 pages
Data Science Intern Profile: Arnav Verma
No ratings yet
Data Science Intern Profile: Arnav Verma
2 pages
Deep Learning-Based BSIM-CMG Parameter Extraction For 10-Nm FinFET
No ratings yet
Deep Learning-Based BSIM-CMG Parameter Extraction For 10-Nm FinFET
4 pages
Fundamentals of Deep Learning
No ratings yet
Fundamentals of Deep Learning
26 pages
CS 485 - 685 - Foundations of Machine Learning (Fall 2021)
No ratings yet
CS 485 - 685 - Foundations of Machine Learning (Fall 2021)
5 pages
19BCS4030 Final Report
No ratings yet
19BCS4030 Final Report
107 pages
Gold Price Prediction Using Machine Learning Ijariie20325
No ratings yet
Gold Price Prediction Using Machine Learning Ijariie20325
5 pages
IJAS - Volume 12 - Issue 3 - Pages 1413-1424
No ratings yet
IJAS - Volume 12 - Issue 3 - Pages 1413-1424
12 pages

Designing Machine Learning Workflows in Python Chapter4

Uploaded by

Designing Machine Learning Workflows in Python Chapter4

Uploaded by

Anomaly detection

Dr. Chris Anagnostopoulos

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Extreme case of dataset shift

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

too few for training without over tting

just enough for model selection

drop unbiased estimate of accuracy

How to t an algorithm without labels?

How to estimate its performance?

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

array([-0.99, -1.02, -1.08 , -0.97])

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Dr. Chris Anagnostopoulos

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

preds = lof().fit_predict( clf = lof(novelty=True)

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

threshold = np.quantile(y_scores, 0.1)

y_pred = y_scores <= threshold

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Dr. Chris Anagnostopoulos

X = [[0,1], [2,3], [0,6]]

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

array([[0., 2., 5.],

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

X = [[0,1], [2,3], [0,6]] squareform

pdist(X, 'cityblock') squareform(pdist(X, 'cityblock'))

array([4., 5., 5.]) array([[0., 4., 5.],

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Class AGE SEX STEROID ...

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

squareform(pdist(X_hep, 'euclidean')) squareform(pdist(X_hep, 'hamming'))

[[ 0. 127. 64.1] [[0. 0.5 0.7]

1 nearest to 3: wrong class 1 nearest to 2: right class

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Dr. Chris Anagnostopoulos

Can we build a detector that ags viruses as anomalous in this data?

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Traceback (most recent call last):

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

# This won't work for a different reason

Traceback (most recent call last):

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

array([136., 2., 136.])

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

LoF detector with a precomputed distance matrix:

# This takes 3 seconds

roc_auc_score(proteins['label'] == 'VIRUS', preds == -1)

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Dr. Chris Anagnostopoulos

Risks of over tting

Noisy labels and heuristics

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Real-world use cases:

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

You might also like