28 Questions – Data Preprocessing &
Normal Distribution (6-Mark Answers)
1. 1. What are missing values, and how do they occur?
Missing values are data points that are not stored or recorded. They may occur due to
sensor errors, human mistakes, data corruption, or skipped survey questions. Missing
values can reduce the quality of analysis, cause biases, and must be handled carefully before
modeling.
2. 2. Explain different types of missing data (MCAR, MAR, MNAR).
MCAR (Missing Completely at Random): Missingness is unrelated to any variable. MAR
(Missing At Random): Related to observed variables but not the missing value itself. MNAR
(Missing Not At Random): Related to the value that is missing. Type determines how to
handle missingness.
3. 3. List and explain methods to handle missing values.
1. Remove rows or columns with too many missing values. 2. Imputation: Fill with mean,
median, mode. 3. Use forward/backward fill. 4. Use algorithms that support missing values
(like XGBoost). 5. Predictive imputation using regression or ML models.
4. 4. Compare mean, median, and mode imputation.
Mean: Good for normal data, affected by outliers. Median: Better for skewed data, not
affected by outliers. Mode: Used for categorical variables. Choice depends on data type and
distribution.
5. 5. What are the disadvantages of deleting missing data?
Deleting rows with missing data can reduce the dataset size, leading to loss of valuable
information and potentially biased models. It is acceptable only when missing values are
few and truly random.
6. 6. What is predictive imputation?
Predictive imputation uses models like regression, KNN, or decision trees to estimate
missing values using other features. It can provide accurate results but adds complexity and
may cause overfitting.
7. 7. How does missing data affect model performance?
Missing data can reduce training data size, distort patterns, and create biases in models. It
may lead to incorrect predictions or errors if not handled properly before model training.
8. 8. What is data scaling, and why is it important?
Data scaling standardizes features to a common scale, especially important for algorithms
like KNN, SVM, and gradient descent. Without scaling, features with large ranges dominate
others.
9. 9. Explain Min-Max Scaling with a formula and example.
Formula: X_scaled = (X - Xmin) / (Xmax - Xmin). Example: If X=40, Xmin=20, Xmax=60 →
Scaled = (40–20)/(60–20) = 0.5. It converts values to range [0, 1].
10. 10. What is Z-score standardization? When is it used?
Z = (X – μ) / σ. It transforms data to have mean = 0 and SD = 1. Used when data is normally
distributed. It's helpful in outlier detection and algorithms requiring standardized input.
11. 11. Compare Min-Max Scaling and Standardization.
Min-Max keeps values in [0,1], sensitive to outliers. Z-score standardization centers data but
allows negative values, works better when data has outliers or is normally distributed.
12. 12. What is robust scaling? When is it preferred?
Robust scaling uses median and IQR: X_scaled = (X – Median) / IQR. Preferred when data
contains outliers, as it reduces their impact compared to Min-Max or Z-score.
13. 13. Explain log and power transformations.
Log/sqrt/power transformations reduce skewness in data. Example: Applying log to prices
or population compresses large values, helps normalize data, and makes models more
effective.
14. 14. What is a normal distribution? List its characteristics.
A bell-shaped, symmetric curve with mean=median=mode. Properties: total area under
curve is 1; 68–95–99.7% rule applies; defined by mean and standard deviation; used in
natural data.
15. 15. Write the formula for a normal distribution curve.
f(x) = (1 / (σ√2π)) * e^(–(x–μ)² / 2σ²), where μ is the mean, σ is standard deviation. It
shows how probability is distributed for normally distributed values.
16. 16. Relationship between standard deviation and normal distribution.
Standard deviation (σ) controls the spread of a normal curve. A higher σ spreads the curve
wider; lower σ makes it narrower. It's key to defining the 68–95–99.7% intervals.
17. 17. Explain the 68–95–99.7 rule with a diagram.
In a normal distribution: 68% of data lies within ±1σ, 95% within ±2σ, and 99.7% within
±3σ. This rule helps estimate the probability of an observation within a range.
18. 18. Why is standard deviation important in statistics?
It measures data spread from the mean. Low σ indicates consistency; high σ shows
variability. It helps detect outliers, compare distributions, and is vital for confidence
intervals.
19. 19. What are outliers in a dataset?
Outliers are values significantly different from others. They may result from errors, rare
events, or natural variation. They distort analysis, especially mean and regression
outcomes.
20. 20. How can outliers be identified?
Methods: Z-score (|Z| > 3), IQR method (outside Q1 – 1.5×IQR or Q3 + 1.5×IQR), boxplots,
scatter plots. Visualization helps detect outliers quickly.
21. 21. What is the IQR method? Explain with an example.
IQR = Q3 – Q1. Outlier if value < Q1 – 1.5×IQR or > Q3 + 1.5×IQR. Example: Q1=20, Q3=40 →
IQR=20 → Outlier < -10 or > 70.
22. 22. Should outliers always be removed? Why or why not?
Not always. Outliers should be removed only if they are errors or irrelevant. In cases like
fraud detection or medical diagnosis, outliers may carry valuable insights.
23. 23. What are the effects of outliers on ML models?
Outliers can skew mean, affect regression lines, reduce accuracy, and cause overfitting.
Tree-based models are more robust, while linear models are more sensitive to outliers.
24. 24. What does the area under a normal curve represent?
It represents the probability of occurrence of values in a range. The total area = 1 (100%). It
is used to compute cumulative probability in statistics.
25. 25. Explain 1σ, 2σ, 3σ, and 6σ models with percentages.
±1σ → 68%, ±2σ → 95%, ±3σ → 99.7%. 6σ → 99.99966%. These show how much data lies
near the mean. Used in Six Sigma for quality assurance.
26. 26. How is Six Sigma used in quality control?
Six Sigma ensures that processes produce only 3.4 defects per million. It uses DMAIC
(Define, Measure, Analyze, Improve, Control) for continuous quality improvement.
27. 27. What percentage of data lies within ±1σ?
Approximately 68.27% of data lies within one standard deviation from the mean in a
normal distribution.
28. 28. What does it mean if a value lies beyond ±3σ?
It is considered an outlier or rare event, lying in the extreme 0.3% of data. It may indicate an
error or something unusual worth investigating.