Assignment #2
Introduction to classification (part 1)
Course Title: Data Mining
Instructor: Dr. Amor Messaoud
Questions
1. Why is naïve Bayesian classification called “naïve”? Briefly outline the major ideas of
naïve Bayesian classification?
Questions
Use the three-class confusion matrix below to answer questions 1 through 3.
1. What percent of the instances were correctly classified?
2. How many class 2 instances are in the dataset?
3. How many instances were incorrectly classified with class 3?
4. Sometimes a data set is partitioned such that a validation set is provided. What is the
purpose of the validation set?
5. If we build a classifier and evaluate it on the training set and the test set:
a. Which data set would we expect to have the higher accuracy: training set or test
set
b. Which data set provides best accuracy estimate on new data: training set test set
6. Consider the one-dimensional data shown in the following table. Classify the data
point x = 5.0 according to its 1-, 3-, and 5-nearest neighbors (using majority vote)
ASSIGNMENT #1 (FEBRUARY 2019) 1
Problem #1
Consider the following dataset of a credit card promotion database. The credit card
company has authorized a new life insurance promotion similar to the existing one. We are
interested in building a classification data mining model for deciding whether to send the
customer promotional material.
1. Build a Naive Bayes classifier for this dataset, by filling in the following with counts
and probabilities.
Life insurance promotion
Y N
Magazine promotion Y
N
Life insurance promotion
Y N
Watch promotion Y
N
Life insurance promotion
Y N
Credit card insurance Y
N
ASSIGNMENT #1 (FEBRUARY 2019) 2
Life insurance promotion
Y N
Sex M
F
2. Use the Naive Bayes classifier obtained in question 1. To determine the value of Life
Insurance Promotion for the following instance:
Magazine Promotion = Y ; Watch Promotion = Y ; Credit Card Insurance = N; Sex =
F; Life Insurance Promotion = ?
Problem #2
Consider the set of training examples in the diagram below. A plus indicates a positive
example and a star indicates a negative example. Use the Euclidian distance to answer the
following questions:
1. How will the point (8, 1) be classified by the 1-nearest neighbor classifier?
2. How will the point (8, 8) be classified by the 3-nearest neighbors?
ASSIGNMENT #1 (FEBRUARY 2019) 3
Problem #3
Lisa has lost gender information of one of her customers, and does not know whether to
make a skirt or trousers. She is planning to throw a coin. Can you help her to make a better
decision using a KNN-classifier (K =3)? Use the Euclidian distance. The customer who is
missing gender information:
Gender Waist Hip
? 28 34
Male 28 32
Male 33 35
Female 27 33
Female 31 36
Problem #4 (Larose and Larose, 2015, p. 312)
The following table contains a small data set of 10 records excerpted from the ClassifyRisk
data set, with predictors’ age, marital status, and income, and target variable risk.
1. Using R find the k-nearest neighbor for Record #10, using k=3.
2. Using the ClassifyRisk data set with predictors age, marital status, and income, and
target variable risk, find the k-nearest neighbor for Record #1, using k=2 and
Euclidean distance.
3. Using the ClassifyRisk data set with predictors age, marital status, and income, and
target variable risk, find the k-nearest neighbor for Record #1, using k=2 and
Minkowski distance.
ASSIGNMENT #1 (FEBRUARY 2019) 4