UNIT-III
Mapping problems to machine learning tasks
•As a data scientist, your task is to map a business problem to a
good machine learning method.
•Let’s look at a real-world situation. Suppose that you’re a data
scientist at an online retail company.
•There are many business problems that you may need to address.
Predicting customers
Identifying fraud transactions
Determining Price
Grouping customers with similar purchasing behaviour
Marketing campaigns
Seminar on
•Supervised Learning
•Unsupervised Learning
•Reinforcement Learning
For the purpose we will group the different kinds
of problems that a data scientist typically solves
into these categories:
Classification—Assigning labels to data
Scoring—Assigning numerical values to
data
Grouping—Discovering commonalities in
data
Classification problems
•Product categorization based on product attributes and/or text descriptions of the product is an
example of classification.
•Suppose your task is to automate the assignment of new products to your company’s product
categories,
Scoring problems
Ex:1
Predicting the increase in sales from a particular marketing
campaign based on factors is an example of scoring
the communication The traffic source
channel (ads on (Facebook, Google,
websites, YouTube radio stations, and so
videos, print media, on);
email, and so on
Example : 2
Supervised
UnSupervised
Grouping
working without known targets
EX: Mobile + Cover+ Tempered glass
Problem-to-method mapping
•You can’t directly use or fit the model
on a set of training data and say ‘Yes, this
will work.’
•To ensure that the model is correctly
trained on the data provided without
much noise, you need to use cross-
validation techniques.
Evaluating Models
•When building a model, you must be able to estimate model quality in order
to ensure that your model will perform well in the real world.
•Now one of the things that help you to identify whether your model is
performing well or not is overfitting
• An undesirable behaviour that occurs when the model gives accurate
predictions for training data but not for new data is overfitting.
•A model’s prediction error on the data that it trained from is called training
error i.e the average loss that occurred during training process
• A model’s prediction error on new data is called generalization error i.e
how accurately your algorithm is going to predict the values.
•In order to evaluate a model’s performance/In order to detect over fitting we
have two categories.
Hold out Cross
validation K-fold
Method cross
validation
Hold Out method
The hold-out method for training
the machine learning models is a
technique that involves splitting the
data into different sets: one set
for training, and other sets
for validation and testing.
The hold-out method is used to
check how well a machine learning
model will perform on the new data.
Instead of using an entire dataset for
training, different sets called
validation set and test set are
separated or set aside (and, thus,
hold-out name) from the entire
dataset and the model is trained only
on what is termed as the training
dataset.
Validation means tuning of hyperparameters.
In machine learning, a hyperparameter is a
parameter whose value is used to control the
learning process.
Hyper parameters for decision tree:
max dept
max leaf nodes
For Random forest:
min sample split
max features
Evaluating models
• To decide if a given score is high or low, we generally
compare our model’s performance to a few baseline models.
• THE NULL MODEL
• SINGLE-VARIABLE MODELS
• GENERAL MODEL /MULTIPLE VARIABLE MODEL
The null Model
• The most typical null model is a model that returns the same
answer for all situations .
• We use null models as a lower bound on desired performance.
• For example, in a categorical problem, the null model would
always return the most popular category, as this is the easy
guess that is least often wrong.
• For a score model, the null model is often the average of all
the outcomes, as this has the least square deviation from all
the outcomes.
• The idea is that if you’re not outperforming the null model,
you’re not delivering value.
SINGLE-VARIABLE MODELS
• Single-variable models are simply models built using only one
variable at a time.
• Single variables can be categorial or numerical.
• A single-variable model based on categorical features is easiest
to describe as a table. Business analysts use a pivot table (which
promotes values or levels of a feature to be families of new
columns) and statisticians use what’s called a contingency
table (where each possibility is given a column name).
• There are a number of ways to use a numeric feature to
make predictions. A common method is to bin the numeric
feature into a number of ranges and then use the range
labels as a new categorical variable.
MULTIPLE VARIABLE MODELS
• Models that combine the effects of many
variables tend to be much more powerful than
models that use only a single variable.
• Variable selection :A key part of building
many variable models is selecting what
variables to use and how the variables are to
be transformed or treated
For example:
Positive linear relationship: In most cases, universally, the
income of a person increases as his/her age increases.
Negative linear relationship: If the vehicle increases its
speed, the time taken to travel decreases, and vice versa.
N=7 N=7
Substitute the values
So it’s a positive
high correlation
cor(cars,method = "pearson")
cor(cars,method ="spearman")
P-VALUE
The p-value is a measure of the evidence against a null hypothesis.
---The Pr(>|z|) column represents the p-value associated with the value in the z
value column.
---If the p-value is less than a certain significance level (e.g. α = .05) then this
indicates that the predictor variable has a statistically significant relationship with
the response variable in the model.
---In simple…..Whatever the selected attributes that you have…. how
effective and how helpful they are in fitting your model.
• The p-value (also called the significance) is one of
the most important diagnostic columns in the
coefficient summary.
• The p-value estimates the probability of seeing a
coefficient with a magnitude as large as you
observed if the true coefficient is really zero (if the
variable has no effect on the outcome).
• So don’t trust the estimate of any coefficient with a
large p-value.
PERFORMANCE METRICS
• Confusion Matrix
• Accuracy
• Precision
• Recall
• F1 Score
Confusion Matrix
The confusion matrix is a table counting how often each combination of
known outcomes (the truth) occurred in combination with each prediction
type.
Accuracy
• For a classifier, accuracy is defined as the number of items
categorized correctly divided by the total number of items.
• Out of 100 predictions, our model predicted the correct element 73 times.
Therefore, the accuracy of our model is 73%; 73 correct predictions out of 100
predictions made.
• How many predictions are actually positive out of all the total positive predicted.
• It tells how often model predictions match the actual labels
of the data
• For example, let’s say we have a machine that classifies if a
fruit is an apple or not. In a sample of hundreds of apples and
oranges, the accuracy of the machine will be how many apples
it classified correctly as apples and how many oranges it
classified as not apples divided by the total number of apples
and oranges.
Accuracy: (TP + TN) / (TP + FP + TN + FN)
Precision
• Precision is defined as the ratio of true positives to predicted
positives.
• Ex: Measure of patients that we correctly identify as
having a heart disease out of all the patients actually having it.
• Out of all the heart patients how many are identified correctly
• Out of all positive predictions how many are correctly
predicted
• precision: TP/(TP + FP)
OR
• TP/predicted positives
Recall
• Recall is the ratio of true positives over all
actual positives,
• Recall=TP/(TP + FN)
or
• TP/all positives
F1Score
• F1 score is a machine learning evaluation metric that measures
a model’s accuracy. It combines the precision and recall scores
of a model.
• Precision measures how many of the “positive” predictions
made by the model were correct.
• Recall measures how many of the positive class samples
present in the dataset were correctly identified by the model.
• The F1 score combines precision and recall using their
harmonic mean, and maximizing the F1 score implies
simultaneously maximizing both precision and recall.
Evaluating Classification Model
library('caTools') #for splitting (sample.split)
cars<-mtcars
summary(mtcars)
#Min – Minimum value in the given data
#1st Quartile – first quartile in the data
#Median – Median of the data
#Mean – Mean of the data
#3rd Quartile – third quartile in the data
#Max – Maximum value in the given data
cor(cars,method = "pearson“)
#split dataset ratio
split <- sample.split(mtcars, SplitRatio = 0.5)
#split it as training set by making 'split' as
TRUE so that 50% is taken as training with
function subset()
train_reg <- subset(mtcars, split == "TRUE")
dim(train_reg)
#Then consider remaining data for testing by
making split as FALSE
test_reg <- subset(mtcars, split == "FALSE”)
dim(test_reg)
# Training model--build model to know "am" using drat, mpg and
gear with training set
#glm--Generalized Linear Model----outcome is 0/1 so
family=binomial
logistic_model
logistic_model <- glm(am ~ mpg + drat + gear, data = train_reg,family =
"binomial")
#view model summary
summary(logistic_model)
#Predicting the probability of training and testing
#type= response gives the predicted probability
train_reg$pred <-predict(logistic_model,newdata=train_reg,
type = 'response')
train_reg$pred
test_reg$pred <- predict(logistic_model,newdata=test_reg,
type = "response")
test_reg$pred
confmat <- table(truth = test_reg$am ,
prediction = ifelse(test_reg$pred > 0.5,"manual", "auto"))
print(confmat)
#accuracy
(confmat[1,1] + confmat [2,2]) / sum(confmat)
precision <- confmat [1,1] / (confmat [1,1]+ confmat [2,1])
print(precision)
recall <- confmat [1,1]/(confmat [1,1]+ confmat [1,2])
print(recall)
F1 <- (2 * precision * recall / (precision + recall) ) # (precision +
recall) )print(F1)
Evaluating Scoring Model
fit a model that predicts temperature (in Fahrenheit) from the chirp rate (chirps/sec)
crickets <- read.csv("CricketChirps1.csv")
#The lm() function creates a linear regression model in R.
#lm( fitting_formula, dataframe )
#dataframe: determines the name of the data frame that contains
the data.
#This function takes an R formula Y ~ X where Y is the outcome
variable and X is the predictor variable.
cricket_model <- lm(temperatureF ~ chirp_rate, data=crickets)
crickets$temp_pred <- predict(cricket_model, newdata=crickets)
crickets$temp_pred
#RMSE is the square root of the mean of the square of all of the
error.
error_sq <- (crickets$temp_pred - crickets$temperatureF)^2
RMSE <- sqrt(mean(error_sq))
#Formula for R-SQUARED
#R^2 = 1 - RSS/TSS
#R^2 = coefficient of determination
#RSS = sum of squares of residuals
#TSS = total sum of squares
error_sq <- (crickets$temp_pred - crickets$temperatureF)^2
numerator <- sum(error_sq)
delta_sq <- (mean(crickets$temperatureF) –(crickets$temperatureF)^2
denominator = sum(delta_sq)
(R2 <- 1 - numerator/denominator)
#plot(cricket_model)
library('lattice')
xyplot(temperatureF ~ chirp_rate, data=crickets)
#If you want design a regression line along with your scatterplot,
use the argument type
#points (“p”) and a regression line (“r”).
xyplot(temperatureF ~ chirp_rate, data=crickets, type=c("p","r"))
The differences between the predictions of temperatureF and temp_pred are called
the residuals or error of the model on the data. We will use the residuals to calculate
some common performance metrics for scoring models.
Evaluating Probability Model
• Probability models are useful for both
classification and scoring tasks.
•Probability models are models that both decide if
an item is in a given class and return an estimated
probability (or confidence) of the item being in
the class.
ROC(Receiver Operating Characteristic Curve
&
AUC(Area Under Curve)
Terms used in AUC and ROC
Curve.
Sensitivity / True Positive Rate
Specificity / True Negative Rate
False Positive Rate
Higher the AUC, better the model's
performance
library('WVPlots')
ROCPlot(test_reg,xvar = 'pred',truthVar = 'am',truthTarget = TRUE ,title =
'mtcars')
LOG LIKELIHOOD
An important evaluation of an estimated probability is
the log likelihood.
The log likelihood is the logarithm of the product of the
probability
For a spam email with an estimated likelihood of 0.9 of
being spam, the log likelihood is log(0.9)
For a non-spam email, the same score of 0.9 is a log
likelihood of log(1-0.9) (or just the log of 0.1, which was
the estimated probability of not being spam).
The closer to 0 the log likelihood is, the better the prediction.
For spam Prob. Of being spam is 0.98
For not-spam Prob. Of being not-spam is (1-0.98)
Probability is used to calculate the chance of an event happening before
it occurs.
Likelihood is used to evaluate how well observed data fits a particular
hypothesis or explanation. Likelihood, on the other hand, is about
assessing the probability of an event after it has already occurred, based
on the evidence or data you have.
Example:
Imagine you are a farmer, and you want to determine the likelihood of
rain tomorrow to decide whether or not to water your crops.
Probability: You found that it rained on 30 out of the last 100 days, then
the probability of rain tomorrow can be estimated as 30% (30 days with
rain out of 100 days total).
Likelihood: You woke up in the morning and noticed dark clouds in the
sky, a drop in temperature, and strong winds, you might say that the
likelihood of rain tomorrow is high.
Another Example for Probability and likelihood
Imagine you and your friend, let's call him John, are going on a picnic, and
you have a weather app that predicts the chance of rain.
Probability: Probability is about predicting the likelihood of an event
before it happens. For example, your weather app says, "There is a 30%
chance of rain today." This means that out of 100 days with similar
weather conditions as today, it's likely to rain on approximately 30 of those
days. Probability deals with predictions or forecasts of future events.
Likelihood: Likelihood, on the other hand, is about assessing the
probability of an event after it has already occurred, based on the evidence
or data you have. So, let's say you and John went on the picnic, and when
you came back home, you noticed that it did rain. Now, you might say, "It
looks like the weather app's prediction was correct; the likelihood of rain
was high today." The likelihood is a measure of how well the observed
data fits a particular probability prediction.
AIC
Akaike information criterion (AIC) is a way to measure the quality of a
statistical model in a simple and understandable manner. It helps us choose the
best model among several competing models
When comparing models (on the same test set), you will generally prefer the
model with the smaller AIC. The AIC is useful for comparing models with
different measures of complexity and modeling variables with differing
numbers of levels.
We want models that fit the data well but aren't too complicated, as overly
complex models can lead to overfitting
Example:
Buying a new smartphone for your grandpa. Imagine you're comparing three
different smartphones (models) based on their features and performance (fitting
the data). The goal is to find the best smartphone that suits your needs.
Simple Model (Model A): This phone has the basic
features you need, such as calling, texting, and high-quality
camera, but it lacks some advanced features like internet
browsing or a large storage capacity.
Intermediate Model (Model B): This phone has more
features, including a better internet browsing and larger
storage capacity, but it's also more expensive than Model
A.
Complex Model (Model C): This phone is a high-end
model with all the latest features, including a top-notch
camera, large storage capacity, virtual reality support, and
more. However, it's the most expensive of the three.
If Model A fits your basic needs and has a lower AIC value,
it would be the preferred choice as it provides a good
balance of utility and affordability.
If Model B has slightly better features and performance, but
the AIC is not significantly lower than Model A, you might
consider whether the extra features are worth the increased
cost.
If Model C has the highest AIC, it may be too complex and
expensive for your needs, and you might decide it's not
worth the extra cost for features you won't use.