Tutorial 7
DSA1101
Introduction to Data Science
October 19, 2018
Exercise 1. The Naı̈ve Bayes Classifier
This week, we will look at the CSV dataset “[Link]” which provides
information on the fate of passengers on the fatal maiden voyage of the ocean
liner Titanic, and includes the variables economic status (class), sex, age and
survival. We will train a naı̈ve Bayes classifier using this dataset, and predict
survival.
(a) Load the dataset “[Link]” which has been posted under the folder
for Tutorial 7.
1 Titanic _ dataset = read . csv ( " Titanic . csv " )
2 dim ( Titanic _ dataset )
3 head ( Titanic _ dataset )
(b) Compute the probabilities P (Y = 1) (survived) and P (Y = 0) (did not
survive).
1 tprior <- table ( Titanic _ dataset $ Survived )
2 tprior
3 tprior <- tprior / sum ( tprior )
4 tprior
1
(c) Compute the conditional probabilities P (Xi = xi |Y = 1) and P (Xi =
xi |Y = 0) , where i = 1, 2, 3, 4 for the feature variables X = {class, sex, age}.
1 classCounts <- table ( Titanic _ dataset [ , c ( " Survived " , " Class " ) ])
2 classCounts <- classCounts / rowSums ( classCounts )
3 classCounts
4
5 genderCounts <- table ( Titanic _ dataset [ , c ( " Survived " , " Sex " ) ])
6 genderCounts <- genderCounts / rowSums ( genderCounts )
7 genderCounts
8
9 ageCounts <- table ( Titanic _ dataset [ , c ( " Survived " , " Age " ) ])
10 ageCounts <- ageCounts / rowSums ( ageCounts )
11 ageCounts
(d) Predict survival for an adult female passenger in 2nd class cabin.
1 prob _ survived <-
2 classCounts [ " Yes " ," 2 nd " ] *
3 genderCounts [ " Yes " ," Female " ] *
4 ageCounts [ " Yes " ," Adult " ] *
5 tprior [ " Yes " ]
6
7 prob _ not _ survived <-
8 classCounts [ " No " ," 2 nd " ] *
9 genderCounts [ " No " ," Female " ] *
10 ageCounts [ " No " ," Adult " ] *
11 tprior [ " No " ]
12
13 prob _ survived
14 prob _ not _ survived
2
(e) Compare your prediction in (d) with the one performed by the naiveBayes
function in package ‘e1071’
1 library ( e1071 )
2
3 model <- naiveBayes ( Survived ~ . ,
4 Titanic _ dataset )
5
6 test <- data . frame ( Class = " 2 nd " , Sex = " Female " ,
7 Age = " Adult " )
8
9 results <- predict ( model , test )
10 results
11 results <- predict ( model , test , " raw " )
12 results
13
14 # ratio of probability scores
15 prob _ survived / prob _ not _ survived
16 # ratio of actual probabilities
17 results [1 , " Yes " ] / results [1 , " No " ]