0% found this document useful (0 votes)
19 views6 pages

Unit - 4 Learning Notes

This document covers advanced analytics and statistical modeling techniques for big data, including Naive Bayesian Classifier, K-means Clustering, and Linear and Logistic Regression. It provides detailed explanations of the Naive Bayesian Classifier with examples, including probability calculations and implementation in R, as well as an overview of K-means clustering and regression modeling. The document includes code snippets for practical application of these techniques using R programming.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views6 pages

Unit - 4 Learning Notes

This document covers advanced analytics and statistical modeling techniques for big data, including Naive Bayesian Classifier, K-means Clustering, and Linear and Logistic Regression. It provides detailed explanations of the Naive Bayesian Classifier with examples, including probability calculations and implementation in R, as well as an overview of K-means clustering and regression modeling. The document includes code snippets for practical application of these techniques using R programming.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Unit – 4

(Learning Notes)
SYLLABUS:
 Advanced Analytics and Statistical Modelling for Big Data:
 Naive Bayesian Classifier,
 K-means Clustering
 Linear and Logistic Regression

Naive Bayseian Classifier


 P(c|x) is the posterior probability of class (c, target)
given predictor (x, attributes).
 P(c) is the prior probability of class.
 P(x|c) is the likelihood which is the probability
of predictor given class.
 P(x) is the prior probability of predictor.

Example: What is the possibilities of playing Golf on a Windy Sunny Day


with Hot Weather and High Humidity.

Row No Outlook Temperature Humidity Wind Play

1 Sunny 85 85 False No
2 Sunny 80 90 True No

3 Overcast 83 78 False Yes


4 Rain 70 96 False Yes
Row No Outlook Temperature Humidity Wind Play

5 Rain 68 80 False Yes

6 Rain 65 70 True No

7 Overcast 64 65 True Yes


8 Sunny 72 95 False No

9 Sunny 69 70 False Yes


10 Rain 75 80 False Yes

 P(C = Play) = 6/10 = 0.6


 P(C = No) = 4/10 = 0.4

Played or Not Played


Yes No P(Yes) P(No)
Outlook
Sunny 1 3 1/6 3/4
Overcast 2 0 2/6 0/4

Rain 3 1 3/6 ¼

Played or Not Played


Yes No P(Yes) P(No)
Temperature
Low (<70) 3 1 3/6 1/4

High(>=70) 3 3 3/6 3/4

Played or Not Played


Yes No P(Yes) P(No)
Humidity
Low (<85) 5 1 5/6 1/4

High(>=85) 3 1 3/6 1/4

Played or Not Played


Yes No P(Yes) P(No)
Wind
True 1 2 1/6 2/4

False 5 2 5/6 2/4


P(X = Wind[True]| PLAY) = 1/6 P(X = Wind[True]| NO) = 2/4
P(X = Sunny| PLAY) = 1/6 P(X = Sunny| NO) = 3/4
P(X = HOT| PLAY) = 3/6 P(X = HOT| NO) = 3/4
P(X = HIGH| PLAY) = 3/6 P(X = HIGH| NO) = 1/4

Given Condition & Played:


P(X|C)*P(C)
=1/6 * 1/6 * 3/6 * 3/6 * 6/10
=(0.16)*(0.16)*(0.5)*(0.5) *(0.6)
=0.0038

Given Condition & Not Played:


P(X|C)*P(C)
=2/4 * 3/4 * 3/4 * 1/4 *4/10
=(0.5) * (0.75) * (0.75) * (0.25) * (0.4)
=0.0281

Normalization Factor
P(X)
=4/10 * 6/10 * 4/10 * 3/10
=(0.4) * (0.6) * (0.4) * (0.3)
= 0.0288

Bayes Probability for Played


P(X|C)*P(C) / P(X) = 0.0038 / 0.0288 = 0.1319

Bayes Probability for Not - Played


P(X|C)*P(C) / P(X) = 0.0281 / 0.0288 = 0.9756
#Getting started with Naive Bayes
#Install the package
#[Link]("e1071")
#Loading the library
library(e1071)
?naiveBayes
#The documentation also contains an example implementation of
Titanic dataset
#Next load the Titanic dataset
data("Titanic")
str(Titanic)

#Save into a data frame and view it


Titanic_df=[Link](Titanic)
head(Titanic_df,10)
#Creating data from table
repeating_sequence=[Link](seq_len(nrow(Titanic_df)), Titanic_df$Freq)
#This will repeat each combination equal to the frequency of each
combination

#Create the dataset by row repetition created


Titanic_dataset=Titanic_df[repeating_sequence,]
#We no longer need the frequency, drop the feature
Titanic_dataset$Freq=NULL

#Fitting the Naive Bayes model


Naive_Bayes_Model=naiveBayes(Survived ~., data=Titanic_dataset)
#What does the model say? Print the model summary
Naive_Bayes_Model

#Prediction on the dataset


NB_Predictions=predict(Naive_Bayes_Model,Titanic_dataset)
NB_Predictions

#Confusion matrix to check accuracy


table(NB_Predictions,Titanic_dataset$Survived)
K – Mean Clustering
Algorithm:
1. Accept the data set
2. Identify two means / random numbers
3. Continue clustering the dataset until Means and equal or Two
successive clusters are same.

Sample – 1:
Dataset: {2, 3, 4, 10, 11, 12, 20, 25, 30}
Means: 3, 12

#Getting started with K-Mean


dataset <- c(2, 3, 4, 10, 11, 12, 20, 25, 30)
dataset

cluster <- kmeans(dataset,c(3,20),[Link] = 10)


cluster$cluster
cluster$size

dataset[cluster$cluster==1]
dataset[cluster$cluster==2]
#Getting started with Regression Model
# The predictor vector.
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)

# The response vector.


y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

# Apply the lm() function.


relation <- lm(y~x)

# Find weight of a person with height 170.


a <- [Link](x = 170)
result <- predict(relation,a)
print(result)

You might also like