NATIONAL INSTITUTE OF BUSINESS MANAGEMENT
HIGHER NATIONAL DIPLOMA IN SOFTWARE ENGINEERING |
HIGHER NATIONAL DIPLOMA IN INSORMATION SYSTEMS
Data Warehousing and Business Intelligence – Data Preprocessing
Question 1
The information gathered on the creditworthiness of individuals in a town is given in
Table. You are required to build a classification model based on this data.
No of
id age sex region income married mortgage
children
ID1210 48 FEMALE INNER_CITY 25500 NO 1 NO
1
ID1210 40 MALE TOWN 43000 YES 3 YES
2
ID1210 51 FEMALE INNER_CITY 22500 YES 0 NO
3
ID1210 23 FEMALE TOWN 20000 YES 3 NO
4
ID1210 57 FEMALE RURAL 50000 ? 0 NO
5
ID1210 45 FEMALE TOWN 47000 YES 2 NO
6
ID1210 32 MALE RURAL 21000 NO 0 NO
7
ID1210 58 MALE ? 45000 YES 0 NO
8
ID1210 37 FEMALE SUBURBAN 35000 YES 2 NO
9
ID1211 54 MALE TOWN 24000 YES 2 NO
0
ID1211 ? FEMALE TOWN 59000 YES 0 NO
1
ID1211 52 FEMALE INNER_CITY 56000 NO 0 YES
2
ID1211 44 FEMALE TOWN 70000 YES 1 YES
3
? 66 FEMALE TOWN 55000 YES 1 YES
ID1211 44 MALE TOWN 45000 YES 2 NO
5
ID1211 62 FEMALE TOWN 36000 YES 1 NO
6
ID1211 26 MALE INNER_CITY 33000 YES 1 NO
7
ID1211 43 FEMALE TOWN 50000 NO 2 YES
8
1
ID1211 42 FEMALE TOWN 25000 YES 1 NO
9
Using the dataset given in Table answer the following questions.
(i) Illustrate two methods to handle each of the missing values (indicate in “?”)
in the dataset (without removing the tuples with missing values).
(ii) Assume the missing value of the age column is 62,
(a) Partition the age into 3 equal frequency bins and smooth it by bin means.
(b) Partition the age into 3 equal width bins and smooth it by bin boundaries.
(iii) If the income is to be normalized between 0 and 1, write the equation for min-
max normalization and calculate the normalized values for the income values
corresponding to the datasets identified by ID12108 and ID12117.
(iv) Calculate the Z-score values for the IDs ID12115 and ID12119 based on the
age feature.
Question 2
empid age sex emp_rank income married Land_owne
r
E1001 25 MALE 2 65000 NO YES
E1002 30 FEMALE 1 80000 YES YES
E1003 50 MALE 2 90000 YES YES
E1004 28 MALE 2 60000 YES YES
E1005 45 MALE 3 30000 NO YES
E1006 24 FEMALE 3 40000 NO NO
E1007 31 FEMALE 1 70000 YES YES
E1008 52 MALE 3 35000 YES NO
E1009 33 FEMALE 1 105000 YES YES
E1010 42 MALE 1 85000 YES YES
E1011 26 FEMALE 2 65000 NO YES
E1012 40 FEMALE 1 75000 NO NO
a. Partition the income into 4 equal frequency bins and smooth it by bin
median.
b. Partition the age into 3 equal width bins and smooth it by bin boundaries.
2
c. If the income is to be normalized between 0 and 1 calculate the min-max
normalized values for the income values corresponding to following
customer IDs.
i. E1003
ii. E1008
iii. E1009
iv. E1010