MACHINE LEARNING
Presenter: Dr. Amit Kumar Das
Professor,
Dept. of Computer Science and Engg.,
Institute of Engineering & Management.
WHAT IS LEARNING?
TYPES OF HUMAN LEARNING
Learning through direct guidance from
expert – is just one form …
Learning through indirect guidance
Learning by self
WHAT IS MACHINE LEARNING?
WHAT IS MACHINE LEARNING?
TYPES OF MACHINE LEARNING
Supervised learning – also called
predictive learning
Unsupervised learning – also called
descriptive learning
Reinforcement learning
MACHINE LEARNING PROCESS
What was the most difficult subject in the last
semester?
What if, you had a list of all possible questions
with answers, and a photographic memory?
MACHINE LEARNING PROCESS
Data Input – Past data or information is
utilized as a basis for future decision-making
Abstraction – The input data is represented
in a broader way through the underlying
algorithm
Generalization – The abstracted
representation is generalized to form a
framework for making decisions
TYPICAL ML PROBLEMS
Prediction of results of a game
Predicting whether a tumor is malignant or
benign
Price prediction in domains like real estate,
stocks, etc.
Demand forecasting in retails
Customer segmentation
Self-driven cars
PROBLEMS NOT TO BE CONSIDERED FOR ML
Bank interest calculation
Inventory management (except the demand
forecast module)
Customer on-boarding (except risk prediction
module)
Tasks in which humans are very effective or
frequent human intervention is needed. For
example, air traffic control
TYPES OF DATA
Qualitative data (Categorical)
Student Name, Blood group, Grade, etc.
Quantitative data (Numerical)
Temperature, Age, Weight, etc.
DATA EXPLORATION
Understand the central tendency –
Mean
Median
Mode
Understand data spread
Standard Deviation
Understand data value position
DATA EXPLORATION – CENTRAL TENDENCY
Mean vs. Median for Auto MPG
DATA EXPLORATION – DATA SPREAD
Consider the data values of two attributes
Attribute 1 values – 44, 46, 48, 45 and 47
Attribute 2 values – 34, 46, 59, 39 and 52
Both the set of values have a mean and
median of 46.
First set of values is more concentrated or
clustered around the mean / median value
DATA EXPLORATION – DATA VALUE POSITION
Any data set attribute has five values
Minimum
First quartile (Q1)
Median (Q2)
Third quartile (Q3), and
Maximum
Minimum Q1 Q3 Maximum
Median (Q2)
DATA EXPLORATION – BOX PLOT
DATA EXPLORATION – BOX PLOT
DATA QUALITY
Most occurring data quality issues are:
Missing values
Outliers
Missing values of attribute “horsepower” in Auto MPG
REMEDIATE DATA ISSUES
Remove missing values / outliers – If
number of records are not many, remove them.
Imputation - Impute the value with mean or
median or mode
Capping - For values that lie outside the
1.5 X IQR limits, cap them by replacing the
observations below the lower limit with value of
5th percentile and those that lie above the upper
limit, with value of 95th percentile
Estimate missing values – Assign attribute
values of similar data points in place of the
missing value
ISSUES IN MACHINE LEARNING
Relatively new and evolving technology
In
different countries, rules and regulations,
cultural background, emotional maturity of
people are drastically different
Biggestfears - potential breach of privacy,
discriminatory behaviour, resulting
discontent
WHAT IS MODELLING IN CONTEXT OF
MACHINE LEARNING?
WHAT ARE THE DIFFERENT ML
ALGORITHMS?
Supervised
Classification – KNN, Naive Bayes, Decision Tree, etc.
Regression – Simple Linear Regression, Logistic
Regression
Unsupervised
Clustering – K-Means
Market Basket Analysis
SUPERVISED LEARNING - CLASSIFICATION
Labelled Training Data
Classifier Classification Model
Test Data
Intel
SUPERVISED LEARNING - REGRESSION
y = α + βx
UNSUPERVISED LEARNING
Unlabelled Data
Unsupervised Learning Model
Grouped data / Clusters
UNSUPERVISED LEARNING - CLUSTERING
Cluster 2
Cluster 1
Cluster 3
Cluster 4
UNSUPERVISED LEARNING – MARKET BASKET
ANALYSIS
SELECTING A MODEL
Predictive models (supervised)
Predict the value of a category or class
Problems that can be solved : Prediction of win/loss,
fraudulent transactions, etc.
Examples : k-Nearest Neighbor (kNN), Naïve Bayes,
Decision Tree, etc.
Predict numerical values of the target
Problems that can be solved : Prediction of revenue
growth, rainfall amount, etc,
Examples: Linear Regression, Logistic Regression, etc.
SELECTING A MODEL
Descriptive
models
(unsupervised)
Group together
similar data
instances
Problems that can be
solved: Customer
grouping or
segmentation based
on social,
demographic, ethnic,
etc. factors
Most popular model
for clustering is k-
Means
TRAIN A MODEL – HOLDOUT METHOD
70% - 80% Training
Data
Input
Data Trained Model
Test
20% - 30% Data
Model Performance
K-FOLD CROSS-VALIDATION– OVERALL APPROACH
K-FOLD CROSS-VALIDATION– DETAILED APPROACH
K-FOLD CROSS-VALIDATION (CONTD.)
BOOTSTRAP SAMPLING / BOOTSTRAPPING
TRAIN A MODEL – UNDER VS. OVER FIT
Under fit Balanced fit Over fit
Under fit Balanced fit Over fit
TRAIN A MODEL – BIAS VS. VARIANCE
EVALUATING A MODEL - CLASSIFICATION
Actual Outcome True Positive (TP) –
Win Loss
Predicted win, Actual win
True Negative (TN) –
Predicted loss, Actual loss
False Positive (FP) –
Win
Predicted win, Actual loss
Predicted Outcome
True Positive (TP) False Positive (FP) False Negative (FN) –
Predicted loss, Actual win
For both TP and TN,
predicted outcome
Loss
matches actual
outcome. Hence, they
False Negative (FN) True Negative (TN)
are correct
classifications.
EVALUATING A MODEL – CLASSIFICATION (CONTD.)
Actual
Actual Outcome
Actual Win Loss
Win Loss
Predicted Win 85 4
Predicted Loss 2 9
Win
Predicted Outcome
True Positive (TP) False Positive (FP)
Loss
False Negative (FN) True Negative (TN)
The percentage of misclassifications are indicated using error rate which is
measured as:
In context of the above confusion matrix,
EVALUATING A MODEL – CLASSIFICATION (CONTD.)
where P(a) = proportion of observed agreement between actual
and predicted in overall data set =
P(pr) = proportion of expected agreement between actual and predicted data both in case
of class of interest as well as the other classes =
Note: Kappa value can be 1 at the maximum, which represents perfect agreement between model’s prediction and actual values.
EVALUATING A MODEL (ROC CURVE)
TPR =
FPR =
Receiver Operating Characteristic curve
EVALUATING A MODEL (REGRESSION)
Value of the apartment unit
Actual value
Error
Predicted value
Area (in square Feet)
EVALUATING A MODEL (CLUSTERING)
“Clustering is in the eye of the beholder"
Internal evaluation
Silhouette width
External evaluation
Purity
EVALUATING A MODEL (CLUSTERING)
Cluster 2
Cluster 1
a(i) Average distance between
ai2 ai1 the ith data instance and all other
data instances belonging to the
b14(1)
same cluster
ain_1 b(i) Lowest average distance
b14(2)
between the i-the data instance and
b14(n4) data instances of all other clusters
Cluster 3
Cluster 4
Silhouette width calculation
ENSEMBLE
THANK YOU &
STAY TUNED!