0% found this document useful (0 votes)

64 views4 pages

28 Questions Data Preprocessing Normal Dist

The document discusses data preprocessing techniques, focusing on missing values, their types, and handling methods. It also covers data scaling methods, normal distribution characteristics, and the significance of standard deviation and outliers in statistical analysis. Additionally, it explains the implications of outliers on machine learning models and the application of Six Sigma in quality control.

Uploaded by

Rktech gaming

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views4 pages

28 Questions Data Preprocessing Normal Dist

Uploaded by

Rktech gaming

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

28 Questions – Data Preprocessing &

Normal Distribution (6-Mark Answers)

1. 1. What are missing values, and how do they occur?

Missing values are data points that are not stored or recorded. They may occur due to
sensor errors, human mistakes, data corruption, or skipped survey questions. Missing
values can reduce the quality of analysis, cause biases, and must be handled carefully before
modeling.

2. 2. Explain different types of missing data (MCAR, MAR, MNAR).

MCAR (Missing Completely at Random): Missingness is unrelated to any variable. MAR

(Missing At Random): Related to observed variables but not the missing value itself. MNAR
(Missing Not At Random): Related to the value that is missing. Type determines how to
handle missingness.

3. 3. List and explain methods to handle missing values.

1. Remove rows or columns with too many missing values. 2. Imputation: Fill with mean,
median, mode. 3. Use forward/backward fill. 4. Use algorithms that support missing values
(like XGBoost). 5. Predictive imputation using regression or ML models.

4. 4. Compare mean, median, and mode imputation.

Mean: Good for normal data, affected by outliers. Median: Better for skewed data, not
affected by outliers. Mode: Used for categorical variables. Choice depends on data type and
distribution.

5. 5. What are the disadvantages of deleting missing data?

Deleting rows with missing data can reduce the dataset size, leading to loss of valuable
information and potentially biased models. It is acceptable only when missing values are
few and truly random.

6. 6. What is predictive imputation?

Predictive imputation uses models like regression, KNN, or decision trees to estimate
missing values using other features. It can provide accurate results but adds complexity and
may cause overfitting.

7. 7. How does missing data affect model performance?

Missing data can reduce training data size, distort patterns, and create biases in models. It
may lead to incorrect predictions or errors if not handled properly before model training.
8. 8. What is data scaling, and why is it important?

Data scaling standardizes features to a common scale, especially important for algorithms
like KNN, SVM, and gradient descent. Without scaling, features with large ranges dominate
others.

9. 9. Explain Min-Max Scaling with a formula and example.

Formula: X_scaled = (X - Xmin) / (Xmax - Xmin). Example: If X=40, Xmin=20, Xmax=60 →

Scaled = (40–20)/(60–20) = 0.5. It converts values to range [0, 1].

10. 10. What is Z-score standardization? When is it used?

Z = (X – μ) / σ. It transforms data to have mean = 0 and SD = 1. Used when data is normally

distributed. It's helpful in outlier detection and algorithms requiring standardized input.

11. 11. Compare Min-Max Scaling and Standardization.

Min-Max keeps values in [0,1], sensitive to outliers. Z-score standardization centers data but
allows negative values, works better when data has outliers or is normally distributed.

12. 12. What is robust scaling? When is it preferred?

Robust scaling uses median and IQR: X_scaled = (X – Median) / IQR. Preferred when data
contains outliers, as it reduces their impact compared to Min-Max or Z-score.

13. 13. Explain log and power transformations.

Log/sqrt/power transformations reduce skewness in data. Example: Applying log to prices

or population compresses large values, helps normalize data, and makes models more
effective.

14. 14. What is a normal distribution? List its characteristics.

A bell-shaped, symmetric curve with mean=median=mode. Properties: total area under

curve is 1; 68–95–99.7% rule applies; defined by mean and standard deviation; used in
natural data.

15. 15. Write the formula for a normal distribution curve.

f(x) = (1 / (σ√2π)) * e^(–(x–μ)² / 2σ²), where μ is the mean, σ is standard deviation. It

shows how probability is distributed for normally distributed values.

16. 16. Relationship between standard deviation and normal distribution.

Standard deviation (σ) controls the spread of a normal curve. A higher σ spreads the curve
wider; lower σ makes it narrower. It's key to defining the 68–95–99.7% intervals.

17. 17. Explain the 68–95–99.7 rule with a diagram.

In a normal distribution: 68% of data lies within ±1σ, 95% within ±2σ, and 99.7% within
±3σ. This rule helps estimate the probability of an observation within a range.

18. 18. Why is standard deviation important in statistics?

It measures data spread from the mean. Low σ indicates consistency; high σ shows
variability. It helps detect outliers, compare distributions, and is vital for confidence
intervals.

19. 19. What are outliers in a dataset?

Outliers are values significantly different from others. They may result from errors, rare
events, or natural variation. They distort analysis, especially mean and regression
outcomes.

20. 20. How can outliers be identified?

Methods: Z-score (|Z| > 3), IQR method (outside Q1 – 1.5×IQR or Q3 + 1.5×IQR), boxplots,
scatter plots. Visualization helps detect outliers quickly.

21. 21. What is the IQR method? Explain with an example.

IQR = Q3 – Q1. Outlier if value < Q1 – 1.5×IQR or > Q3 + 1.5×IQR. Example: Q1=20, Q3=40 →
IQR=20 → Outlier < -10 or > 70.

22. 22. Should outliers always be removed? Why or why not?

Not always. Outliers should be removed only if they are errors or irrelevant. In cases like
fraud detection or medical diagnosis, outliers may carry valuable insights.

23. 23. What are the effects of outliers on ML models?

Outliers can skew mean, affect regression lines, reduce accuracy, and cause overfitting.
Tree-based models are more robust, while linear models are more sensitive to outliers.

24. 24. What does the area under a normal curve represent?

It represents the probability of occurrence of values in a range. The total area = 1 (100%). It
is used to compute cumulative probability in statistics.

25. 25. Explain 1σ, 2σ, 3σ, and 6σ models with percentages.

±1σ → 68%, ±2σ → 95%, ±3σ → 99.7%. 6σ → 99.99966%. These show how much data lies
near the mean. Used in Six Sigma for quality assurance.

26. 26. How is Six Sigma used in quality control?

Six Sigma ensures that processes produce only 3.4 defects per million. It uses DMAIC
(Define, Measure, Analyze, Improve, Control) for continuous quality improvement.
27. 27. What percentage of data lies within ±1σ?

Approximately 68.27% of data lies within one standard deviation from the mean in a
normal distribution.

28. 28. What does it mean if a value lies beyond ±3σ?

It is considered an outlier or rare event, lying in the extreme 0.3% of data. It may indicate an
error or something unusual worth investigating.

Ds 5 Marks Final
No ratings yet
Ds 5 Marks Final
11 pages
Data Science
No ratings yet
Data Science
32 pages
Das FFFF
No ratings yet
Das FFFF
16 pages
Interview Questions
No ratings yet
Interview Questions
225 pages
Data Mining: Preprocessing Techniques
No ratings yet
Data Mining: Preprocessing Techniques
33 pages
CS3552 - Fods - QB 2024
No ratings yet
CS3552 - Fods - QB 2024
11 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
ML Chapter 2
No ratings yet
ML Chapter 2
9 pages
Assignment 02
No ratings yet
Assignment 02
9 pages
Task1,2 Interview Question
No ratings yet
Task1,2 Interview Question
6 pages
Important Questions
No ratings yet
Important Questions
26 pages
EDA Guide for Data Analysts
No ratings yet
EDA Guide for Data Analysts
35 pages
Foundation of Data Science Previous Year Question Paper
100% (1)
Foundation of Data Science Previous Year Question Paper
40 pages
AIML Unit 2 Understanding Data
No ratings yet
AIML Unit 2 Understanding Data
51 pages
Business Analytics 2 Marks Question Bank
No ratings yet
Business Analytics 2 Marks Question Bank
5 pages
Datascience Interview
100% (1)
Datascience Interview
31 pages
Lec06 7 Feature Engineering 08112022 100115am
No ratings yet
Lec06 7 Feature Engineering 08112022 100115am
44 pages
Question Bank - Intro To Data Science
No ratings yet
Question Bank - Intro To Data Science
2 pages
Machine Learning Lab Viva QA
No ratings yet
Machine Learning Lab Viva QA
4 pages
FDS - 3 Solved
No ratings yet
FDS - 3 Solved
21 pages
Unit 1
No ratings yet
Unit 1
26 pages
Unit 1
No ratings yet
Unit 1
21 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
Data Preprocessing Techniques Overview
No ratings yet
Data Preprocessing Techniques Overview
66 pages
100 Plus Statistics Interview Questions
0% (1)
100 Plus Statistics Interview Questions
44 pages
Data Preprocessing Techniques Overview
No ratings yet
Data Preprocessing Techniques Overview
66 pages
ML Question Bank
No ratings yet
ML Question Bank
1 page
Summary Chap 1 & 2
No ratings yet
Summary Chap 1 & 2
5 pages
FDS PYQ Solution
No ratings yet
FDS PYQ Solution
8 pages
Module 3 Answers Updated
No ratings yet
Module 3 Answers Updated
6 pages
Feature Engineering Techniques Guide
No ratings yet
Feature Engineering Techniques Guide
69 pages
Levels of Measurement Q A
No ratings yet
Levels of Measurement Q A
16 pages
Da Mid1
No ratings yet
Da Mid1
32 pages
UNIT02
No ratings yet
UNIT02
41 pages
Practice Quiz
No ratings yet
Practice Quiz
10 pages
Practice Quiz
100% (1)
Practice Quiz
20 pages
Data Mining
No ratings yet
Data Mining
5 pages
ML ch-1
No ratings yet
ML ch-1
32 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
Interview Questions
No ratings yet
Interview Questions
27 pages
Q.1. Why Is Data Preprocessing Required?
100% (1)
Q.1. Why Is Data Preprocessing Required?
26 pages
Basicof Stats
No ratings yet
Basicof Stats
7 pages
Data Analytics Questions
No ratings yet
Data Analytics Questions
6 pages
UNIT 2 DT
No ratings yet
UNIT 2 DT
8 pages
Cognizant Data Analyst Interview Questions 1745235888
No ratings yet
Cognizant Data Analyst Interview Questions 1745235888
18 pages
Data Preparation Notebook
No ratings yet
Data Preparation Notebook
14 pages
Unit 1
No ratings yet
Unit 1
34 pages
FDS 1
No ratings yet
FDS 1
5 pages
EDA Question Bank Answers
No ratings yet
EDA Question Bank Answers
24 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
48 pages
Module 3 Notes
No ratings yet
Module 3 Notes
5 pages
DAI101 4 Data Preparation
No ratings yet
DAI101 4 Data Preparation
45 pages
ML Unit 1 Part 2
No ratings yet
ML Unit 1 Part 2
56 pages
Ai - Foundations of Machine Learning III
No ratings yet
Ai - Foundations of Machine Learning III
98 pages
936-Module 04 PPT
No ratings yet
936-Module 04 PPT
15 pages
Crack Data Science Interview 1731300339
No ratings yet
Crack Data Science Interview 1731300339
132 pages
Tutorial 2 Solutions
No ratings yet
Tutorial 2 Solutions
5 pages
Sigma Practice Questions 249 Question
100% (11)
Sigma Practice Questions 249 Question
29 pages
DBT Advertisement For JRF Position 10012024
No ratings yet
DBT Advertisement For JRF Position 10012024
3 pages
Group 2 Cations
50% (2)
Group 2 Cations
15 pages
SKKD 162, SKKE 162: Thyristor Bridge, SCR, Bridge
No ratings yet
SKKD 162, SKKE 162: Thyristor Bridge, SCR, Bridge
4 pages
(Routledge Research in Art History) Sarah J Lippert - Artistic Responses To Travel in The Western Tradition-Routledge (2018)
No ratings yet
(Routledge Research in Art History) Sarah J Lippert - Artistic Responses To Travel in The Western Tradition-Routledge (2018)
275 pages
Ground Floor Plan: Bureau of Design
No ratings yet
Ground Floor Plan: Bureau of Design
1 page
Adaptive Time Step
No ratings yet
Adaptive Time Step
2 pages
Makro Black Friday Deals 2018
90% (20)
Makro Black Friday Deals 2018
20 pages
Present Continuous Tense Examples
No ratings yet
Present Continuous Tense Examples
3 pages
Food Processing Rajeev Ranjan 208
No ratings yet
Food Processing Rajeev Ranjan 208
41 pages
BLW29
No ratings yet
BLW29
12 pages
Characteristics of Success
No ratings yet
Characteristics of Success
2 pages
2
No ratings yet
2
3 pages
Retail Shop Management
No ratings yet
Retail Shop Management
18 pages
Latest New Syllabus 11 Chapter Wise Karnataka CET Physics Chemistry Biology
No ratings yet
Latest New Syllabus 11 Chapter Wise Karnataka CET Physics Chemistry Biology
13 pages
Tenses Class8
No ratings yet
Tenses Class8
15 pages
Grade 10 English Fal P2 Nov 2022 Memo
No ratings yet
Grade 10 English Fal P2 Nov 2022 Memo
6 pages
Ral 204848 GM
No ratings yet
Ral 204848 GM
238 pages
Bovine Basics For Beginners
No ratings yet
Bovine Basics For Beginners
5 pages
Project Report On Crypto Currency
No ratings yet
Project Report On Crypto Currency
54 pages
Securitisation Risk Transfer Guide
No ratings yet
Securitisation Risk Transfer Guide
155 pages
Military Pain Relief Acupuncture
100% (7)
Military Pain Relief Acupuncture
44 pages
Rural Marketing Strategies
No ratings yet
Rural Marketing Strategies
4 pages
2006 BIR - Ruling - DA 745 06 - 20180405 1159 Sdfpar
No ratings yet
2006 BIR - Ruling - DA 745 06 - 20180405 1159 Sdfpar
7 pages
HT Lecture 01 Modes of HeatTransfer
No ratings yet
HT Lecture 01 Modes of HeatTransfer
29 pages
Uses of Different Tools and Equip in Electroinics
No ratings yet
Uses of Different Tools and Equip in Electroinics
14 pages
Advancements in Non-Destructive Testing
No ratings yet
Advancements in Non-Destructive Testing
10 pages
Corrosive Ooze Giant by Sonixverse Labs - GM Binder
No ratings yet
Corrosive Ooze Giant by Sonixverse Labs - GM Binder
3 pages
Malaysia Asean Math Olympiads: Rules and Regulations
No ratings yet
Malaysia Asean Math Olympiads: Rules and Regulations
7 pages
Petroleum Geology of South Australia Complete
No ratings yet
Petroleum Geology of South Australia Complete
183 pages
Feeding System PDF
100% (1)
Feeding System PDF
52 pages

28 Questions Data Preprocessing Normal Dist

Uploaded by

28 Questions Data Preprocessing Normal Dist

Uploaded by

28 Questions – Data Preprocessing &

Normal Distribution (6-Mark Answers)

2. 2. Explain different types of missing data (MCAR, MAR, MNAR).

MCAR (Missing Completely at Random): Missingness is unrelated to any variable. MAR

3. 3. List and explain methods to handle missing values.

4. 4. Compare mean, median, and mode imputation.

5. 5. What are the disadvantages of deleting missing data?

6. 6. What is predictive imputation?

7. 7. How does missing data affect model performance?

9. 9. Explain Min-Max Scaling with a formula and example.

Formula: X_scaled = (X - Xmin) / (Xmax - Xmin). Example: If X=40, Xmin=20, Xmax=60 →

10. 10. What is Z-score standardization? When is it used?

Z = (X – μ) / σ. It transforms data to have mean = 0 and SD = 1. Used when data is normally

11. 11. Compare Min-Max Scaling and Standardization.

12. 12. What is robust scaling? When is it preferred?

13. 13. Explain log and power transformations.

Log/sqrt/power transformations reduce skewness in data. Example: Applying log to prices

14. 14. What is a normal distribution? List its characteristics.

A bell-shaped, symmetric curve with mean=median=mode. Properties: total area under

15. 15. Write the formula for a normal distribution curve.

f(x) = (1 / (σ√2π)) * e^(–(x–μ)² / 2σ²), where μ is the mean, σ is standard deviation. It

16. 16. Relationship between standard deviation and normal distribution.

17. 17. Explain the 68–95–99.7 rule with a diagram.

18. 18. Why is standard deviation important in statistics?

19. 19. What are outliers in a dataset?

20. 20. How can outliers be identified?

21. 21. What is the IQR method? Explain with an example.

22. 22. Should outliers always be removed? Why or why not?

23. 23. What are the effects of outliers on ML models?

26. 26. How is Six Sigma used in quality control?

28. 28. What does it mean if a value lies beyond ±3σ?

You might also like