Data Preprocessing - Unit 2 Chapter 3 Questions
Theory Questions
1. Differentiate between data cleaning, data integration, data reduction, and data transformation with suitable
examples.
2. Enlist and briefly explain the six key elements of data quality.
3. Explain the need for data preprocessing in real-world data mining applications.
4. Differentiate between dimensionality reduction and numerosity reduction.
5. Enlist and describe different methods to handle missing values during data cleaning.
6. Explain the concept of normalization. What are the commonly used normalization techniques?
7. Explain the steps involved in data integration. How does it help avoid redundancies and inconsistencies?
8. Describe the different strategies for data transformation with examples (e.g., smoothing, aggregation).
9. Explain the process of data discretization and concept hierarchy generation with examples.
10. Differentiate between supervised and unsupervised discretization, and between top-down and bottom-up
approaches.
Problem-Based Questions
1. A dataset contains age values: 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35,
35, 36, 40, 45, 46, 52, 70.
(a) Use smoothing by bin means with bin size 3.
(b) Comment on the effect of smoothing.
2. Normalize the values 200, 300, 400, 600, 1000 using:
(a) Min-max normalization with range [0,1]
(b) Z-score normalization
(c) Decimal scaling normalization.
Data Preprocessing - Unit 2 Chapter 3 Questions
3. Use min-max normalization to transform the value 35 from a dataset where min = 13 and max = 70.
4. Given attributes: age and body fat for a dataset.
(a) Perform Z-score normalization
(b) Compute correlation coefficient and determine the correlation type.
5. A sales dataset has values: 5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215.
Apply equal-width and equal-frequency binning.
Comment on the advantages of each.