Unit – 4
(Learning Notes)
SYLLABUS:
Advanced Analytics and Statistical Modelling for Big Data:
Naive Bayesian Classifier,
K-means Clustering
Linear and Logistic Regression
Naive Bayseian Classifier
P(c|x) is the posterior probability of class (c, target)
given predictor (x, attributes).
P(c) is the prior probability of class.
P(x|c) is the likelihood which is the probability
of predictor given class.
P(x) is the prior probability of predictor.
Example: What is the possibilities of playing Golf on a Windy Sunny Day
with Hot Weather and High Humidity.
Row No Outlook Temperature Humidity Wind Play
1 Sunny 85 85 False No
2 Sunny 80 90 True No
3 Overcast 83 78 False Yes
4 Rain 70 96 False Yes
Row No Outlook Temperature Humidity Wind Play
5 Rain 68 80 False Yes
6 Rain 65 70 True No
7 Overcast 64 65 True Yes
8 Sunny 72 95 False No
9 Sunny 69 70 False Yes
10 Rain 75 80 False Yes
P(C = Play) = 6/10 = 0.6
P(C = No) = 4/10 = 0.4
Played or Not Played
Yes No P(Yes) P(No)
Outlook
Sunny 1 3 1/6 3/4
Overcast 2 0 2/6 0/4
Rain 3 1 3/6 ¼
Played or Not Played
Yes No P(Yes) P(No)
Temperature
Low (<70) 3 1 3/6 1/4
High(>=70) 3 3 3/6 3/4
Played or Not Played
Yes No P(Yes) P(No)
Humidity
Low (<85) 5 1 5/6 1/4
High(>=85) 3 1 3/6 1/4
Played or Not Played
Yes No P(Yes) P(No)
Wind
True 1 2 1/6 2/4
False 5 2 5/6 2/4
P(X = Wind[True]| PLAY) = 1/6 P(X = Wind[True]| NO) = 2/4
P(X = Sunny| PLAY) = 1/6 P(X = Sunny| NO) = 3/4
P(X = HOT| PLAY) = 3/6 P(X = HOT| NO) = 3/4
P(X = HIGH| PLAY) = 3/6 P(X = HIGH| NO) = 1/4
Given Condition & Played:
P(X|C)*P(C)
=1/6 * 1/6 * 3/6 * 3/6 * 6/10
=(0.16)*(0.16)*(0.5)*(0.5) *(0.6)
=0.0038
Given Condition & Not Played:
P(X|C)*P(C)
=2/4 * 3/4 * 3/4 * 1/4 *4/10
=(0.5) * (0.75) * (0.75) * (0.25) * (0.4)
=0.0281
Normalization Factor
P(X)
=4/10 * 6/10 * 4/10 * 3/10
=(0.4) * (0.6) * (0.4) * (0.3)
= 0.0288
Bayes Probability for Played
P(X|C)*P(C) / P(X) = 0.0038 / 0.0288 = 0.1319
Bayes Probability for Not - Played
P(X|C)*P(C) / P(X) = 0.0281 / 0.0288 = 0.9756
#Getting started with Naive Bayes
#Install the package
#[Link]("e1071")
#Loading the library
library(e1071)
?naiveBayes
#The documentation also contains an example implementation of
Titanic dataset
#Next load the Titanic dataset
data("Titanic")
str(Titanic)
#Save into a data frame and view it
Titanic_df=[Link](Titanic)
head(Titanic_df,10)
#Creating data from table
repeating_sequence=[Link](seq_len(nrow(Titanic_df)), Titanic_df$Freq)
#This will repeat each combination equal to the frequency of each
combination
#Create the dataset by row repetition created
Titanic_dataset=Titanic_df[repeating_sequence,]
#We no longer need the frequency, drop the feature
Titanic_dataset$Freq=NULL
#Fitting the Naive Bayes model
Naive_Bayes_Model=naiveBayes(Survived ~., data=Titanic_dataset)
#What does the model say? Print the model summary
Naive_Bayes_Model
#Prediction on the dataset
NB_Predictions=predict(Naive_Bayes_Model,Titanic_dataset)
NB_Predictions
#Confusion matrix to check accuracy
table(NB_Predictions,Titanic_dataset$Survived)
K – Mean Clustering
Algorithm:
1. Accept the data set
2. Identify two means / random numbers
3. Continue clustering the dataset until Means and equal or Two
successive clusters are same.
Sample – 1:
Dataset: {2, 3, 4, 10, 11, 12, 20, 25, 30}
Means: 3, 12
#Getting started with K-Mean
dataset <- c(2, 3, 4, 10, 11, 12, 20, 25, 30)
dataset
cluster <- kmeans(dataset,c(3,20),[Link] = 10)
cluster$cluster
cluster$size
dataset[cluster$cluster==1]
dataset[cluster$cluster==2]
#Getting started with Regression Model
# The predictor vector.
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
# The response vector.
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
# Apply the lm() function.
relation <- lm(y~x)
# Find weight of a person with height 170.
a <- [Link](x = 170)
result <- predict(relation,a)
print(result)