0% found this document useful (0 votes)
28 views5 pages

Data Mining Algorithms - Exam 22/23

Uploaded by

mau.spires
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views5 pages

Data Mining Algorithms - Exam 22/23

Uploaded by

mau.spires
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Data Mining and Big Data Analytics 2022/23

CSI-6-DMA Semester 1

Question 1

Choose the best answer to each of the following questions (1.5 marks each):

1.1. Given a two-class classification problem, which of the following models is the
worst one?
(a) A model that has 100% false positive rate and 0% true positive rate.
(b) A model that has 100% false positive rate and 100% true positive rate.
(c) A model that has 0% false positive rate and 100% true positive rate.
(d) A model that has 0% false positive rate and 0% true positive rate.

1.2. For a given association rule, moving an item from the consequent of the rule to
the antecedent of the rule never changes the __________ of the association rule.
(a) support, confidence, and lift
(b) support and lift
(c) confidence and lift
(d) support

1.3. Given three itemsets X, Y, and Z, where 𝑋 ⊂ 𝑌 ⊂ 𝑍. If Y is frequent, then


__________.
(a) 𝑋 is either frequent or infrequent
(b) both 𝑋 and 𝑍 are infrequent
(c) 𝑍 is either frequent or infrequent
(d) both 𝑋 and 𝑍 are frequent

1.4. In a confusion matrix for a classifier, the sum of all the diagonal elements in the
matrix is the total number of the __________ samples that have been classified
__________ by the classifier.
(a) training, incorrectly
(b) testing, incorrectly
(c) training, correctly
(d) testing, correctly

1.5. The k-means clustering algorithm can be used for which of the following tasks?
(a) Anomaly detection.
(b) Outlier detection.
(c) Partition a sample space into several non-overlapping segments.
(d) Unsupervised classification.
(e) All of these.

Page 1 of 5
Data Mining and Big Data Analytics 2022/23
CSI-6-DMA Semester 1

1.6. Suppose a binary decision tree has 𝑚 nodes excluding all the leaf nodes, where
𝑚 is an integer number and 𝑚 ≥ 1. Then the decision tree forms __________
mutually exclusive partitions in the sample space.
(a) 𝑚 + 1
(b) m
(c) 𝑚/2
(d) 𝑚2

1.7. If a data mining algorithm continues to reduce the error on __________ set at a
cost of an increased error on __________, then model over-fitting happens.
(a) training, training
(b) training, testing
(c) testing, training
(d) testing, testing

1.8. Suppose that a transactional database has 𝑚 distinct items, where 𝑚 is an integer
number, then the total number of the itemsets that can be extracted from the
database is __________.
(a) 𝑚
(b) 𝑙𝑜𝑔2 𝑚
(c) 2𝑚 − 1
(d) 𝑚2

1.9. In __________ modelling, a given data set is usually divided into __________.
(a) descriptive, validation and test subsets
(b) predictive, testing and validation subsets
(c) descriptive, training and testing subsets
(d) predictive, training and testing subsets

1.10. Which of the following statements is true in the context of data mining?
(a) A linear regression model can be represented in the form of a decision tree.
(b) An association rule doesn’t represent a causal relationship between items.
(c) The output of a logistic regression model indicates the likelihood
(probability) of a sample to be classified into a class.
(d) Clusters created by the k-means algorithm can be represented in the form of a
binary decision tree.
(e) All of these.

Total: 15 Marks

Page 2 of 5
Data Mining and Big Data Analytics 2022/23
CSI-6-DMA Semester 1

Question 2

A fraud warning system has been developed by an insurance company to identify any
fraudulent insurance claims with a reasonably low false-alarm rate. Two models, M1
and M2, have been constructed for the system. Each model classifies a claim as either
True class or False class. The cost matrix used in the classifier design is shown
below, and the test results of the two models are given in the following confusion
matrices:

Cost matrix for the classifier design


Predicted Class
True False
Actual True -1 1
Class False 4 0

Confusion matrix for the two classification models


Predicted Class Predicted Class
Model M1 Model M2
True False True False
Actual True 30 0 Actual True 15 15
Class False 10 10 Class False 0 20

(a) What are the accuracy and cost of each of the two classifiers? You must show
clearly how you get your answer. You may leave your answers in the form of
fractions if you wish.
(12 marks)

(b) Which model has a lower false-alarm rate, i.e., a true claim has been classified as
False? You must show clearly how you get your answer. You may leave your
answers in the form of fractions if you wish.
(6 marks)

(c) Which model would you choose for the classification problem? Give reasons to
justify your choice.
(12 marks)

Total: 30 Marks

Page 3 of 5
Data Mining and Big Data Analytics 2022/23
CSI-6-DMA Semester 1

Question 3

(a) Consider the following data types:


a. Ordinal and binary.
b. Interval and discrete.
c. Ratio and discrete.
d. Ratio and continuous.

Give one variable as an example for each of these data types. Your answer should
include some possible values that each variable can take on.
(12 marks)

(a) Consider a dataset about road accidents in the area of London Bought of
Southwark. The variables of the dataset are shown below. Discuss what data pre-
processing tasks may need to be undertaken and explain why, if the k-means
clustering algorithm is to be applied for grouping the accidents into meaningful
segments (clusters).

Value range if numeric


Variable Data
Variable Description variable or distinct values if
Name Type
categorical variable
ACC_ID Accident ID Nominal Sequential integer number
Level of Accident
S_LEVEL Ordinal Fatal, Serious, Light
severity
Type of junction Authorised person, Auto
J_ CTRL Nominal
control traffic signal, Stop sign.
Number of casualties
CASU Ratio 0 – 40
in an incident
COST Cost of an accident (£) Ratio 100.00 – 5,000,000.00

(18 marks)

Total: 30 Marks

Page 4 of 5
Data Mining and Big Data Analytics 2022/23
CSI-6-DMA Semester 1

Question 4

(a) Name any four methods of outlier detection. Choose one of the named methods to
explain how it works with an appropriate example. Your answer must state clearly
to which type of data each named method is applicable.
(15 marks)

(b) Write brief notes to discuss how to choose a proper minimum support threshold in
association rule analysis.
(10 marks)

Total: 25 Marks

END OF PAPER

Page 5 of 5

You might also like