0% found this document useful (0 votes)
43 views81 pages

FSP Notes

Uploaded by

sgamingp.sg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views81 pages

FSP Notes

Uploaded by

sgamingp.sg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

Module 1: Introduction to Statistics

1. What is Statistics?
Definition: Statistics is the art of learning from data.

It helps in collecting, organizing, analyzing, and interpreting data to draw conclusions.

The study of statistics can be broken down into four main stages:

1. Data Collection – Gathering information.

2. Data Description – Summarizing and organizing data.

3. Data Analysis – Identifying relationships between variables.

4. Drawing Inferences – Making conclusions based on analysis.

2. Data Collection
First step in statistics: You need data to analyze before making any conclusions.

Two ways to collect data:

1. Using Pre-existing Data:

Data that is already available from reliable sources.

Examples: Census data, government reports, industry studies.

2. Conducting Experiments:

Designing and performing experiments to collect new data.

Example: A teacher testing two different teaching methods.

1/5
3. Example: Experiment on Teaching Styles
Scenario: A teacher wants to determine which teaching style is more effective.

Approach:

Divide the class into two groups.

Use Teaching Style 1 for Group 1.

Use Teaching Style 2 for Group 2.

Measure their exam scores to compare effectiveness.

Potential Issue: Bias in group selection.

If Group 1 consists of novice learners and Group 2 has experienced students, then
differences in scores might not be due to teaching styles.

Solution: Groups should be selected randomly to ensure fairness.

4. Data Description
After collecting data, it must be organized and summarized.

This includes:

Raw Data: The actual recorded values.

Summary Measures:

Mean (Average)

Median (Middle value)

Mode (Most frequent value)

The branch of statistics that deals with data description is called Descriptive Statistics.

5. Data Analysis and Inference


Once data is collected and summarized, the next step is drawing conclusions.

This falls under Inferential Statistics.

2/5
The Role of Probability in Statistical Analysis
Could results occur just by chance?

Example:

What if Group 2 performed better just by coincidence?

Probability helps measure the likelihood of such occurrences.

Probability Models help determine how random an event is.

Key Takeaway: Statistical inference requires an understanding of probability.

6. Understanding Population and Samples


What is a Population?
Definition: The entire collection of individuals/items being studied.

Example: If we want to find the average age of all residents in a town, the entire town’s
residents make up the population.

What is a Sample?
Definition: A subset of the population that is examined to make conclusions about the
whole.

Example: Instead of surveying all residents in the town, we randomly select 100 people
and use their ages to estimate the town's average age.

Sampling Bias
A sample must be representative of the population.

Example of bad sampling:

If we collect data from only people entering a library, the sample may
overrepresent younger people.

Solution: Select participants randomly.

3/5
7. Example: Election Polling
Scenario: Predicting the winner between Party A and Party B.

Which sampling method is the best?

A) Poll all voters at a college basketball game. ❌ (Not representative—may include


mostly young voters.)

B) Poll all voters at a fancy restaurant. ❌ (Not representative—may include mostly


wealthy individuals.)

C) Obtain a voter registration list and randomly select 100 names. ✅ (Best method
—random selection of eligible voters.)

❌ (Only opinionated viewers call, leading to bias.)


D) Conduct a TV call-in poll.

E) Choose names from a telephone directory. ❌ (Might exclude younger voters


without landlines.)

Real-life Example: 1936 US Presidential Election


Polling Failure: A magazine named Literary Digest incorrectly predicted that Alfred
Landon would win instead of Franklin Roosevelt.

Reason for Failure:

They only polled automobile and telephone owners.

Problem: During the Great Depression, only wealthy people owned cars and
phones.

Result: The sample did not represent the whole population, leading to incorrect
predictions.

Lesson Learned:

A sample must include diverse and randomly chosen participants to be accurate.

8. Summary of Key Takeaways


Statistics helps in learning from data.

Four key stages:

4/5
1. Data Collection

2. Data Description

3. Data Analysis

4. Inference & Probability

Good sampling is critical:

A sample should be randomly selected to avoid bias.

Probability is important in understanding how random chance affects results.

Real-world case studies (like the 1936 US election polling failure) show the importance
of proper sampling methods.

9. What’s Next?
In the next session, we will dive deeper into Descriptive Statistics:

How to summarize data

Measures of central tendency (Mean, Median, Mode)

Graphical Representations (Histograms, Bar Graphs, Pie Charts, etc.)

Stay tuned for the next module! 🚀

5/5
Descriptive Statistics: Organizing &
Visualizing Data
1. Introduction to Descriptive Statistics
Descriptive statistics involves organizing and summarizing data.

This session focuses on organizing and visualizing data.

2. Organizing Data
Example Scenario: Streaming Platform Preferences
A survey was conducted with 60 students to determine their preferred streaming
platform.

The options included: Netflix, Prime, Disney Hotstar, YouTube, Z5, and Sony Liv.

The results showed different counts of votes per platform.

Frequency Table
Definition: A table displaying each category and the number of occurrences (frequency).

Example Structure:

Streaming Platform Frequency (Votes)

Netflix 12

Prime 10

Hotstar 15

1/6
Streaming Platform Frequency (Votes)

YouTube 8

Z5 9

Sony Liv 6

3. Visualizing Data
Types of Graphs

(A) Line Graph

X-axis: Different categories (Streaming platforms).

Y-axis: Frequency (Votes).

Points are plotted and connected with lines.

(B) Bar Graph

Similar to a line graph but with thicker bars instead of lines.

Key Difference: The width of bars makes it visually distinct.

(C) Frequency Polygon

Created by connecting the peaks of a line graph.

Shows trends and patterns effectively.

4. Relative Frequency Table


Definition: A table where each frequency is converted into a proportion of the total.

Formula:

Frequency of a category
Relative Frequency =
Total number of values (N)

Example Calculation:

2/6
Netflix: 12
60
​ = 0.20 (20%)
Hotstar: 15
60
​ = 0.25 (25%)
Visual Representation: Pie Chart

Each category is represented as a sector of a circle.

The angle of each sector is proportional to the relative frequency.

5. Grouped Data & Histograms


Issue with Individual Values
If data contains a large range of values (e.g., student marks between 0–100), direct
plotting is inefficient.

Histograms (Grouped Data)


Solution: Data is grouped into equal class intervals (e.g., 0–10, 11–20, etc.).

Class Interval Considerations:

Too large: Important details are lost.

Too small: Data becomes too fragmented.

Class Boundaries
Example (Marks Distribution):

Interval: 35–40

Convention: Left End Inclusion → Includes 35 but not 40.

If using [ ], value is included.

If using ( ), value is excluded.

6. Relative Frequency Histogram


The same histogram, but Y-axis is converted to relative frequencies.

3/6
Total of all Y-values should sum to 1 (or 100%).

7. Cumulative Frequency & Ogives


Cumulative Frequency
Definition: Sum of all previous frequencies up to a certain class interval.

Example:

Marks Range Frequency Cumulative Frequency

0 - 10 5 5

11 - 20 8 13 (5+8)

21 - 30 12 25 (5+8+12)

Final Cumulative Frequency = Total number of data points.

Cumulative Relative Frequency


Cumulative frequency values are converted to relative values.

Formula:

Cumulative Frequency
Relative Cumulative Frequency =
Total Values

The final value should always be 1 (or 100%).

Ogives (Cumulative Frequency Graphs)


Ogive curves always increase.

X-axis: Upper Boundaries of Intervals.

Y-axis: Cumulative Frequencies.

8. Practical Example: Bulb Lifetime

4/6
Data: Lifetimes of 200 bulbs ranging from 500 to 1500 hours.

Steps:

Group into class intervals of 100 hours.

Create a cumulative frequency table.

Convert into a relative frequency histogram & cumulative frequency graph.

Use of CDF (Cumulative Distribution Function)


Helps in decision-making.

Example: Warranty Period Selection

If 51% of bulbs last 1000 hours or less, then 1000 hours is a good warranty period.

Reliability Calculation
Example:

At 800 hours, 9.5% of bulbs failed.

Reliability = 100% - 9.5% = 90.5%.

9. Summary of Key Points


Organizing Data:

Frequency Table and Relative Frequency Table.

Visualizing Data:

Line Graph, Bar Graph, Pie Chart, Frequency Polygon.

Grouped Data:

Histograms & Class Intervals.

Cumulative Frequency:

Ogives & Cumulative Distribution Functions (CDFs).

Real-World Applications:

Product warranty, reliability analysis.

5/6
This breakdown includes every major concept discussed in your file while maintaining extra
details for better understanding. Let me know if you need anything refined! 🚀
2/2

6/6
Central Tendency

1. Introduction to Measures of Central Tendency


Definition: Statistical tools that summarize a dataset by identifying a central or typical
value.

Used to understand, compare, and make decisions based on data.

The three main measures:

1. Mean (Arithmetic Average)

2. Median (Middle Value)

3. Mode (Most Frequently Occurring Value)

2. Mean (Arithmetic Mean)


Definition:
The sum of all data points divided by the total number of data points.

Formula:
∑ xi
ˉ) =

M ean(x ​

N
Where:

xi are the individual data points.


N is the total number of data points.

Example Calculation:

1/6
Data: 10, 15, 20

Mean:

10 + 15 + 20 45
= = 15
3 3
​ ​

Real-Life Example: Average Mobile Data Usage


Survey: 20 college students were asked about their monthly mobile data usage (in GB).

Collected Data (GB):


5, 8, 10, 10, 12, 12, 12, 15, 15, 15, 18, 18, 20, 20, 22, 25, 30, 35, 40, 100 (outlier).

Finding the Mean:

512
= 25.6 GB
20

Conclusion:

The average data usage is 25.6 GB.

However, most students use less than this due to the 100 GB outlier.

Mean is sensitive to outliers.

3. Median (Middle Value)


Definition:
The middle value when data is arranged in ascending order.

If N is odd: The middle value is at position:

N +1
2
If N is even: The median is the average of the two middle values.

Steps to Find Median:


1. Sort the Data (ascending order).

2. Locate the Middle Position:

Odd N: Pick the middle value.

2/6
Even N: Take the average of the two middle values.

Example Calculation (Same Data as Before):


1. Sorted Data:

5, 8, 10, 10, 12, 12, 12, 15, 15, 15, 18, 18, 20, 20, 22, 25, 30, 35, 40, 100

2. Middle Position:

N = 20 (Even)
Middle two values: 10th and 11th values → 15 & 18.

Median:

15 + 18
= 16.5 GB
2

3. Conclusion:

The median is 16.5 GB.

Better representative of central value than mean.

Not affected by outliers like 100 GB.

4. Mode (Most Frequent Value)


Definition:
The most frequently occurring value in a dataset.

There can be:

One mode → Unimodal (e.g., 5, 5, 6, 7 → Mode: 5)

Two modes → Bimodal (e.g., 5, 5, 6, 6, 7 → Modes: 5, 6)

More than two modes → Multimodal.

Example Calculation (Same Data as Before):


Mode Calculation:

12 occurs 3 times.

15 occurs 3 times.

3/6
Modes: 12 GB & 15 GB (Bimodal Dataset).

Conclusion:
Mode shows the most common data usage trends.

More useful than mean when identifying popular values.

5. Key Observations & Comparison

Measure Value Key Property Best Used When?

Mean 25.6 GB Affected by outliers Understanding overall data usage

Median 16.5 GB Resistant to outliers Understanding typical usage

Mode 12 & 15 GB Shows most frequent values Identifying popular trends

Mean is misleading when outliers exist.

Median gives a better central value in skewed distributions.

Mode helps find the most common values.

6. Practical Applications of Mean, Median, and Mode


When to Use Mean?
Overall Trends:

Telecom Company estimating total network demand.

City Government analyzing average household income.

Budgeting & Planning:

Universities estimating total data required for all students.

When to Use Median?


Skewed Data Situations:

4/6
Salaries in a Company (Bill Gates example – outliers affect mean).

Housing Prices (Few expensive houses skew mean).

Real-Life Example:

10 Microsoft employees:

9 employees earn $100,000 each.

Bill Gates earns $5 million.

Mean Salary: ≈ 600, 000 (skewed).


Median Salary: 100,000 (better central value).

When to Use Mode?


Finding Popular Choices:

Marketing: Identifying most sold product size.

Data Plans: Telecom company choosing the best-selling data pack.

Decision Making Example:

A telecom provider should offer 12GB & 15GB plans, as those are the most used.

7. When Can Mean Be Misleading?


Example 1: Outlier Effect

Data: 5, 10, 15, 20, 100.


5+10+15+20+100
Mean: 5
​ = 30 (skewed).
Median: 15 (better representative).

Example 2: Decision Making

A telecom company sees Mean = 25.6 GB and assumes 30GB plans are best.

Issue? Most students use 12GB or 15GB → 30GB plan isn't necessary.

5/6
8. Recap & Final Thoughts
Mean: Best for overall trends, but affected by outliers.

Median: Best for skewed data, gives better central representation.

Mode: Best for common trends (popular choices in a dataset).

Choosing the right measure is crucial:

Use Mean for total estimates.

Use Median for skewed data.

Use Mode for popular choices.

9. What’s Next?
Upcoming Topic: Sample Mean & Variance Calculations

We will discuss how variability affects data interpretation.

Stay tuned! 🚀

6/6
Mean,
Variance, and Standard Deviation

1. Recap: Measures of Central Tendency


Previously discussed:

Mean: Sum of values divided by the number of values.

Median: Middle value in a sorted dataset.

Mode: Most frequently occurring value.

This session focuses on:

Mean visualization in frequency distributions.

Understanding dispersion (variance and standard deviation).

2. Mean in Frequency Distribution


Definition
Mean μ (Mu) is calculated as:

∑ xi
μ=

N
Where:

xi = individual values.

N = total number of values.

Example: Students' Marks


Given marks of 10 students:

1/7
30, 50, 50, 60, 60, 60, 70, 80, 90, 90

Mean Calculation:

Sum of marks: 30 + 50 + 50 + 60 + 60 + 60 + 70 + 80 + 90 + 90 = 630.


Mean:

630
= 63
10

Notations:

Population mean: μ.

ˉ.
Sample mean: x

Mean from Frequency Table


Given a frequency distribution:

Marks Frequency (No. of Students)

30 1

40 0

50 2

60 3

70 1

80 1

90 2

Formula using frequency:

∑(xi ⋅ fi )
μ=
​ ​

∑ fi

xi = Data values (marks).


fi = Frequency (no. of students with those marks).


∑ fi = Total students = 10.


Mean Calculation:

(30 × 1) + (50 × 2) + (60 × 3) + (70 × 1) + (80 × 1) + (90 × 2)


= 30 + 100 + 180 + 70 + 80 + 180 = 630.

2/7
Mean = 630
10
​ = 63.

3. Visualizing Mean with Histograms


Histogram Representation:

X-axis: Marks Range (e.g., 30–40, 40–50).

Y-axis: Frequency (Number of students).

Mean Location: Mean appears around the center of the histogram.

Effect of Outliers on Mean:

If high values (e.g., 90s) are removed, mean shifts left.

If low values are removed, mean shifts right.

4. Understanding Dispersion (Spread of Data)


Definition
Dispersion measures how spread out the data points are.

While mean gives a central value, dispersion helps understand variability.

Common Measures of Dispersion:

Range: Difference between max and min values.

Variance: Measures squared deviation from the mean.

Standard Deviation: Square root of variance.

Example
Consider two datasets:

Dataset A: 50, 55, 60, 65, 70 → Mean = 60.

Dataset B: 30, 45, 60, 75, 90 → Mean = 60.

Both have the same mean, but B is more spread out.

3/7
5. Variance: Measuring Data Spread
Definition
Variance (σ 2 ) tells us how far values are from the mean.

Formula for population variance:

2 ∑(xi − μ)2
σ =

N
Where:

xi = Individual data points.


μ = Mean.
N = Total number of values.
Variance for Sample Data:

2 ˉ )2
∑(xi − x
s =

N −1

Uses N − 1 instead of N for better estimation of population variance.

Example Calculation

Given Data (Population):

Marks: 30, 50, 50, 60, 60, 60, 70, 80, 90, 90.

Mean (μ): 63.

Step 1: Calculate Deviations from Mean

Value (xi )
​ xi − μ
​ (xi − μ)2

30 -33 1089

50 -13 169

50 -13 169

60 -3 9

60 -3 9

60 -3 9

4/7
Value (xi )
​ xi − μ
​ (xi − μ)2

70 7 49

80 17 289

90 27 729

90 27 729

Total 0 2180

Step 2: Compute Population Variance

2180
σ2 = = 218
10

Population variance = 218.

Step 3: Compute Sample Variance

Sample size N = 10.


Formula for sample variance:

2180 2180
s2 = = = 242.2
10 − 1 9
​ ​

Sample variance = 242.2 (larger than population variance).

6. Standard Deviation (σ )

Definition
Square root of variance.

Formula:

σ= σ2 ​

For population:

σ= 218 ≈ 14.76

For sample:

s= 242.2 ≈ 15.56 ​

5/7
Interpretation:

Smaller standard deviation → Data points close to mean.

Larger standard deviation → Data is spread out.

7. Population vs Sample Variance

Feature Population Variance (σ 2 ) Sample Variance (s2 )


2 2
∑(xi −μ) ∑(xi −x
ˉ)
Formula
​ ​

N N −1
​ ​

Use Case When entire population is available When using a subset (sample)

Bias Unbiased Corrects for bias (larger)

Example Cities' full census data Survey of a few citizens

8. Why Use N − 1 in Sample Variance?


Using N − 1 instead of N corrects bias.
A sample’s variance underestimates true population variance.

Dividing by N − 1 makes sample variance closer to true population variance.

9. Summary & Next Steps


Mean helps find central value, but does not show spread.

Variance & Standard Deviation measure how dispersed data is.

Key formulas:

σ 2 (Population Variance).
s2 (Sample Variance).
σ (Population Standard Deviation).

6/7
s (Sample Standard Deviation).
Next Session: Advanced statistics concepts (skewness, kurtosis).

Stay tuned! 🚀

7/7
Extra Detailed Notes on Normal
Distribution & Central Limit Theorem

1. Introduction
This session covers:

Normal Distribution: What it is and its properties.

Skewness in Data: Left and right-skewed distributions.

Central Limit Theorem (CLT): How larger sample sizes lead to normal distributions.

Chebyshev’s Inequality: How it applies to normal and general distributions.

2. What is Normal Distribution?


Definition
A probability distribution where data is symmetrically clustered around a central
value.

Forms a bell-shaped curve (Gaussian Curve).

Key Feature: Most values are near the mean, with fewer occurring further away.

Graphical Representation
X-axis: Data values.

Y-axis: Probability or frequency.

The peak represents the mean, median, and mode (which are equal in a perfect normal
distribution).

1/6
3. Properties of a Normal Distribution
1. Symmetric Around the Mean:

The left and right sides of the curve are mirror images.

Mean = Median = Mode.

2. Predictable Data Spread (Empirical Rule - 68-95-99.7 Rule):

68% of values fall within 1 standard deviation (σ).

95% of values fall within 2 standard deviations (2σ).

99.7% of values fall within 3 standard deviations (3σ).

3. Total Area Under the Curve = 1 (100%):

Higher probability near the mean, lower at extremes.

4. Tails Extend Infinitely:

Though data points at extremes are rare, they never reach zero.

Example: Student Marks


Mean = 70 marks.

Standard Deviation (σ) = 5 marks.

1σ range (68% of students): 65 to 75 marks.

2σ range (95% of students): 60 to 80 marks.

3σ range (99.7% of students): 55 to 85 marks.

4. Normal vs Skewed Distributions


A. Symmetric (Normal) Distribution
Mean, Median, and Mode are equal.

Example: Heights of people in a population.

B. Left-Skewed Distribution (Negative Skew)


Tail on the left (few extreme low values).

2/6
Mean < Median < Mode.

Example: Income distribution in a poor community.

C. Right-Skewed Distribution (Positive Skew)


Tail on the right (few extreme high values).

Mode < Median < Mean.

Example: CEO salaries in a company.

5. Bimodal Distributions
Definition: A distribution with two peaks.

Occurs when two different groups exist in a dataset.

Example:

Exam scores of two groups (one prepared, one unprepared).

A mixture of two populations with different heights.

6. Real-Life Examples of Normal Distribution


Biological:

Human heights, blood pressure levels.

IQ scores in a population.

Social & Economic:

Employee performance ratings.

Exam scores (when no extreme factors exist).

Income levels in a balanced economy.

Industrial & Scientific:

Manufacturing defects in a factory.

3/6
Lifespan of light bulbs.

7. Central Limit Theorem (CLT)


Definition
No matter the original shape of the population distribution, the sampling
distribution of the mean will be normal if the sample size is large enough.

Sample size needed: Typically ≥30 for good normal approximation.

Key Implication
Even if data is skewed, taking many random samples and calculating the means will
eventually create a normal distribution.

Why CLT is Important?


Allows statistical inference (like hypothesis testing).

Helps estimate population parameters using sample statistics.

Example: Examining 100 student scores instead of the entire country’s scores.

8. Comparison: Normal Distribution vs Chebyshev’s


Inequality
Chebyshev’s Inequality
Applies to ANY distribution (not just normal).

Gives an upper bound for probabilities.

Formula:

1
P (∣X − μ∣ ≥ kσ) ≤
k2

At k = 2 (2σ range), at most 25% of values are outside.

4/6
At k = 3 (3σ range), at most 11.1% of values are outside.

Comparison with Normal Distribution


Normal Distribution is more precise:

Within 2σ → 95% of values (instead of at most 75%).

Within 3σ → 99.7% of values (instead of at most 88.9%).

9. Summary

Concept Description

Normal Distribution Bell-shaped curve where data clusters around mean.

Empirical Rule 68% within 1σ, 95% within 2σ, 99.7% within 3σ.

Skewness Data can be left-skewed (negative) or right-skewed (positive).

Bimodal Distribution Two peaks indicate mixed groups in data.

Real-World Applications Human traits, IQ scores, test scores, production quality.

Central Limit Theorem Larger samples always approximate normality.

Chebyshev’s Inequality Works for any distribution but gives upper bounds only.

10. Next Steps


Upcoming topic: Probability and Normal Distribution.

Key Focus:

Understanding Probability Density Function (PDF).

Using Z-scores to compare values.

Applying normal distribution to real-world problems.

5/6
🔹 Key Takeaway: Many real-world data follow normal distribution, and the Central Limit
Theorem explains why. This is fundamental for statistical analysis! 🚀

6/6
Extra Detailed Notes on Chebyshev’s
Inequality & Probability Applications

1. Introduction
Recap of previous topics:

Measures of Central Tendency: Mean, Median, Mode.

Measures of Dispersion: Variance and Standard Deviation.

Normal Distribution: Bell-curve behavior of data.

Central Limit Theorem: Large sample sizes approach normality.

New topic: Chebyshev’s Inequality

Helps analyze data with unknown distributions.

Provides an upper bound on how much data deviates from the mean.

Applies to all datasets, unlike normal distribution rules.

2. Understanding Variance & Standard Deviation Recap


Variance (σ 2 )
Measures spread of data points from the mean.

Formula:

∑(xi − μ)2
σ2 =

N
Unit of variance: Square of the data’s unit.

Standard Deviation (σ )

1/6
Square root of variance:

σ= σ2 ​

Same unit as data, easier to interpret.

Helps understand how much values typically deviate from the mean.

3. Probability & Frequency in Data


Probability = Likelihood of an event occurring.

Relative Frequency Approach:

Frequency of a value divided by total occurrences.

When dataset is large, the frequency function turns into a probability density
function (PDF).

Example: Exam Scores

100,000 students' scores plotted as a histogram.

Frequencies converted into relative probabilities.

Leads to a smooth probability distribution curve.

4. Chebyshev’s Inequality: Definition & Formula


Key Idea:

Determines the probability that a value is far from the mean.

True for any dataset, no matter its shape.

Formula
1
P (∣X − μ∣ ≥ kσ) ≤
k2

Where:

X = Random variable.

2/6
μ = Mean.
σ = Standard deviation.
k = Number of standard deviations away from the mean.

Meaning
Probability of being beyond k standard deviations from the mean is at most k12 .

Gives an upper bound, meaning actual probability could be lower.

5. Chebyshev’s Inequality in Action


Example 1: Fraud Detection in AI Systems
Scenario:

An AI system tracks daily transaction values on an e-commerce platform.

Mean transaction value: $200.

Standard deviation: $50.

Company defines fraudulent transactions as those three standard deviations


away.

Step 1: Identify Thresholds

Fraudulent range:

200 − (3 × 50) = 50
200 + (3 × 50) = 350
Transactions below $50 or above $350 are suspicious.

Step 2: Compute Probability

Using Chebyshev’s Formula:

1 1
P (∣X − 200∣ ≥ 3(50)) ≤ = ≈ 11.1%
32 9
​ ​

Interpretation:

At most 11% of transactions could be fraudulent.

3/6
AI flags these transactions for further review.

Example 2: Delivery Time Guarantees


Scenario:

AI tracks food delivery times.

Mean time: 30 minutes.

Standard deviation: 5 minutes.

Company guarantees deliveries between 20-40 minutes.

Step 1: Define Deviations

Range from Mean:

Lower bound: 30 − 2(5) = 20


Upper bound: 30 + 2(5) = 40

Step 2: Compute Probability

Using Chebyshev’s Formula:

1 1
P (∣X − 30∣ ≥ 2(5)) ≤ = = 25%
22 4
​ ​

Interpretation:

At most 25% of deliveries will be late or too early.

At least 75% will meet the guarantee.

Example 3: Late Delivery Identification


Company defines "very late" as more than 45 minutes.

Threshold:

30 + (3 × 5) = 45
Using Chebyshev’s Formula:

4/6
1 1
P (∣X − 30∣ ≥ 3(5)) ≤ = = 11.1%
32 9
​ ​

Conclusion:

At most 11% of deliveries will be very late.

89% will be on time.

6. Why is Chebyshev’s Inequality Useful?

Feature Benefit

Works for Any Distribution Can be applied to non-normal data.

Provides a Worst-Case Bound Ensures extreme values are rare.

Useful for Risk Assessment AI systems, fraud detection, delivery guarantees.

7. Comparison: Chebyshev vs Normal Distribution

Concept Chebyshev’s Inequality Normal Distribution (68-95-99.7 Rule)

Applies to Any dataset Only normal (bell curve) datasets

1σ Range No guarantee 68% of values

2σ Range At most 25% outside 95% inside

3σ Range At most 11% outside 99.7% inside

Use Case Unknown distributions Well-formed normal data

Key Takeaway: Chebyshev’s rule sets a safe boundary for extreme values, even if the
dataset isn’t normal.

8. Summary & Key Takeaways

5/6
Chebyshev’s Inequality:

Defines probability bounds for data deviation from the mean.

Works for any dataset, unlike normal distribution rules.

Gives a worst-case probability limit.

Real-World Applications:

Fraud detection: Detecting outlier transactions.

AI-powered logistics: Ensuring on-time delivery rates.

Risk assessment: Estimating chances of extreme values.

Next Session:

Paired datasets & normal distribution applications.

Using Z-scores for statistical comparisons.

🔹 Key Takeaway: Chebyshev’s inequality helps analyze extreme values in any dataset,
making it useful in AI, fraud detection, and logistics. 🚀

6/6
Extra Detailed Notes on Types of Data
in Statistics

1. Introduction
Why understanding data types is important?

Helps in selecting the right statistical methods.

Determines how data should be collected, analyzed, and interpreted.

Classification of Data:

Quantitative Data (Numerical)

Qualitative Data (Categorical)

2. Main Types of Data


A. Quantitative Data (Numerical)
Definition: Data that consists of numbers and can be measured or counted.

Subtypes:

1. Discrete Data

2. Continuous Data

1. Discrete Data

Definition: Data that can take only specific, countable values (no in-between values).

Examples:

Number of students in a class (10, 20, 25 but not 22.5).

Number of books in a library (50, 100, 150 but not 125.4).

1/5
Steps taken per day (1000, 1500, 2000 but not 1750.8).

2. Continuous Data

Definition: Data that can take any value within a range, including decimals and
fractions.

Examples:

Height of a person (165.5 cm, 170.3 cm).

Weight (55.2 kg, 60.8 kg).

Temperature (25.3°C, 30.7°C).

Time taken to complete a task (12.75 seconds).

B. Qualitative Data (Categorical)


Definition: Data that represents categories, groups, or labels rather than numbers.

Subtypes:

1. Nominal Data

2. Ordinal Data

1. Nominal Data (No Order)

Definition: Categorical data with no inherent order.

Examples:

Eye color (Black, Brown, Blue, Green).

Car brands (Toyota, Honda, Ford, Tesla).

Blood type (A, B, AB, O).

Gender (Male, Female, Non-binary).

2. Ordinal Data (Ordered)

Definition: Categorical data that has a meaningful order but unequal spacing between
categories.

Examples:

2/5
Customer satisfaction ratings:

Poor → Average → Good → Excellent.

Education levels:

High School → Bachelor’s → Master’s → PhD.

Ranking in a competition:

1st place → 2nd place → 3rd place.

3. Summary Table of Data Types

Type Definition Examples

Quantitative Numerical data, measurable Age, height, weight, income

Discrete (Quantitative) Countable numbers, no fractions Number of students, books,


steps

Continuous Any value within a range, includes Height, weight, temperature


(Quantitative) decimals

Qualitative Categorical data, labels Gender, nationality, car brands

Nominal (Qualitative) Categories with no specific order Eye color, blood type, city names

Ordinal (Qualitative) Ordered categories, no equal spacing Satisfaction ratings, education


levels

4. Real-World Examples of Data Types

Scenario Data Type Explanation

Number of cars sold Discrete Countable whole numbers (e.g., 15, 22)

Time spent on a website Continuous Any value within a range (e.g., 3.45 seconds)

Customer reviews (1 to 5 Ordinal Ordered, but difference between 4★ and 5★ isn’t


stars) fixed

Favorite ice cream flavors Nominal No inherent order (Vanilla, Chocolate, Strawberry)

3/5
Scenario Data Type Explanation

Students’ heights in a class Continuous Any value within a range (e.g., 165.3 cm)

5. Key Differences Between Data Types

Feature Discrete Continuous Nominal Ordinal

Numerical? ✅ Yes ✅ Yes ❌ No ❌ No


Can have decimals? ❌ No ✅ Yes ❌ No ❌ No
Can be ranked? ❌ No ❌ No ❌ No ✅ Yes
Example Number of students Height Eye color Satisfaction rating

6. Importance of Understanding Data Types


Choosing the right analysis method:

Mean & Standard Deviation → Used for continuous data.

Frequencies & Percentages → Used for nominal data.

Ranking Methods → Used for ordinal data.

Visualization Techniques:

Bar Charts → Best for categorical (nominal & ordinal) data.

Histograms → Best for continuous data.

Pie Charts → Best for nominal data with percentages.

7. Quiz Questions
Identify the Data Type
1. The number of students in a class: __________?

4/5
2. The brand of smartphone a person uses: __________?

3. A patient’s blood pressure readings: __________?

4. Rankings in a singing competition: __________?

5. The weight of newborn babies: __________?

Answers:

1. Discrete (Countable whole numbers)

2. Nominal (Categories with no order)

3. Continuous (Can take any value in a range)

4. Ordinal (Ranked but unequal spacing)

5. Continuous (Measured with decimals)

8. What’s Next?
Upcoming Topic: Percentiles & Quartiles.

Preparation:

Learn Linear Interpolation (used to estimate percentiles).

Understand how percentiles divide data into meaningful sections.

🔹 Key Takeaway: Understanding data types is essential for choosing the right statistical
tools and interpreting results correctly! 🚀

5/5
Extra Detailed Notes on Paired Data,
Scatter Plots & Correlation

1. Introduction to Paired Data


Definition
A paired dataset consists of two related variables collected for each individual or
observation.

Example:

Single-variable dataset: Recording only caffeine consumption of 1000 people.

Paired dataset: Recording caffeine consumption & heart rate of the same 1000
people.

Why Paired Data is Important?


Helps understand relationships between two variables.

Allows for better decision-making in data-driven applications.

Can be visualized using scatter plots to detect patterns.

2. Real-World Examples of Paired Data

Scenario Variable 1 Variable 2

Caffeine Study Caffeine Consumption (mg) Heart Rate (bpm)

Social Media & Sleep Time spent on Social Media (hours) Sleep Duration (hours)

Studying & Exam Performance Study Time (hours) Exam Score (%)

Weather & Beverage Sales Temperature (°C) Hot Beverage Sales (₹)

Physical Activity Steps Taken (count) Calories Burned (kcal)

1/6
3. Visualizing Paired Data: Scatter Plots
Definition
A scatter plot is a graph where:

X-axis: Represents one variable.

Y-axis: Represents the second variable.

Each point represents one observation.

Example: Social Media vs Sleep Duration


X-axis: Time spent on Social Media (hours).

Y-axis: Sleep Duration (hours).

Each dot represents one person’s data.

Steps to Plot a Scatter Plot


1. Select the dataset (e.g., social media usage & sleep duration).

2. Choose one variable for X-axis (e.g., social media hours).

3. Choose the second variable for Y-axis (e.g., sleep hours).

4. Plot points for each individual’s data.

Interpreting a Scatter Plot


If points form a downward trend → Negative correlation.

If points form an upward trend → Positive correlation.

If points are random & scattered → No correlation.

4. Types of Relationships in Scatter Plots


A. Negative Relationship (Negative Correlation)
Definition: As one variable increases, the other decreases.

2/6
Example:

More time on social media → Less sleep duration.

Higher temperature → Lower hot beverage sales.

Visual Representation: Downward sloping scatter plot.

B. Positive Relationship (Positive Correlation)


Definition: As one variable increases, the other also increases.

Example:

More time spent studying → Higher exam scores.

More steps taken → More calories burned.

Visual Representation: Upward sloping scatter plot.

C. No Relationship (No Correlation)


Definition: No clear pattern between two variables.

Example:

Shoe size vs Monthly Grocery Expenditure.

Visual Representation: Randomly scattered points.

5. Understanding Strength of Correlation


What is Correlation?
Correlation measures the strength and direction of a relationship between two
variables.

Quantified using the Correlation Coefficient (r).

Interpreting Correlation Coefficient (r)

Value of r Strength of Correlation Type

+1.0 Perfect Correlation Positive

+0.8 to +1.0 Strong Correlation Positive

3/6
Value of r Strength of Correlation Type

+0.5 to +0.8 Moderate Correlation Positive

0 to +0.5 Weak Correlation Positive

0 No Correlation None

−0.5 to 0 Weak Correlation Negative

−0.8 to −0.5 Moderate Correlation Negative

−1.0 to −0.8 Strong Correlation Negative

−1.0 Perfect Correlation Negative

Examples
r = +0.93 → Strong positive correlation (Study Time vs Exam Scores).
r = −0.71 → Moderate negative correlation (Social Media vs Sleep).
r = 0.01 → No correlation (Shoe Size vs Grocery Spending).

6. Correlation ≠ Causation
Key Concept
Just because two variables are correlated doesn’t mean one causes the other.

Example 1: Ice Cream Sales & Drowning Cases


Data shows ice cream sales & drowning cases increase in summer.

Does eating ice cream cause drowning? → No!

Hidden Factor: Hot weather increases both.

Example 2: Study Hours & Exam Scores


More study time = Higher exam scores.

But does studying longer always guarantee better marks? → No!

Other factors: Study quality, test anxiety, health, etc.

4/6
7. How to Calculate Correlation in Google Sheets
Using the CORREL Function
1. Select a blank cell.

2. Type:

excel

=CORREL(A2:A1000, B2:B1000)

3. Press Enter.

4. The result is the correlation coefficient (r).

Example Calculations

Dataset Expected Correlation (r)

Study Time vs Exam Scores +0.93 (Strong Positive)


Social Media vs Sleep −0.71 (Moderate Negative)
Shoe Size vs Grocery Expense 0.01 (No Correlation)

8. Summary & Key Takeaways

Concept Definition

Paired Data Dataset containing two related variables.

Scatter Plot Graph to visualize relationships between two variables.

Negative Correlation One variable increases, the other decreases.

Positive Correlation Both variables increase together.

No Correlation No pattern or relationship.

Correlation Coefficient (r) Measures strength and direction of correlation.

Correlation ≠ Causation Correlation does not imply one variable causes the other.

5/6
9. What’s Next?
Upcoming Topics:

Regression Analysis: Predicting values using paired data.

How to interpret the slope of scatter plots quantitatively.

🔹 Key Takeaway: Paired data, scatter plots, and correlation are powerful tools to analyze
relationships between variables. However, correlation alone is not proof of causation! 🚀

6/6
Linear
Interpolation

1. Introduction
Why is interpolation important?

Used to estimate unknown values between two known data points.

Assumes linear change between values.

Commonly used in statistics, engineering, machine learning, and physics.

Real-world applications:

Estimating temperature between two recorded hours.

Predicting stock market prices between two trading points.

Filling missing data in AI models.

2. Concept of Triangles & Ratios


Recall from geometry: If two lines are parallel, their segments divide proportionally.

Example: Dividing a Line into Equal Segments


1. Given a straight line from (0,4):

0 → 1 → 2 → 3 → 4.

The point at 3 divides the total length in a 3:1 ratio.

This ratio is consistent across parallel lines.

2. If a vertical line represents height:

Suppose at 0 → height is A.

1/5
Suppose at 4 → height is B.

Height at 3 can be found using:

3
B + (A − B)
4

3. Linear Interpolation Formula


Definition: If only two values A and B are known, linear interpolation estimates a
missing value at any intermediate point.

Formula:
x
Y =B+ (A − B)

total distance
Where:

Y = Interpolated value.
A = Value at one endpoint.
B = Value at the other endpoint.
x = Distance from B to the required point.
total distance = Distance between A and B.

4. Example Calculations
Example 1: Temperature Estimation

Given Data:

Temperature at 10 AM: 20°C.

Temperature at 2 PM: 30°C.

Find the estimated temperature at 12 PM.

Solution:

2/5
Ratio of time:

Total gap: 2 PM - 10 AM = 4 hours.

Distance from 10 AM to 12 PM = 2 hours.

Ratio: 24 ​ = 12 .

Using Interpolation Formula:

2
Y = 20 + (30 − 20)
4

2
= 20 + × 10
4

= 20 + 5 = 25°C

Estimated temperature at 12 PM = 25°C.

Example 2: Predicting a Salary Increase

Given Data:

Salary in 2020: ₹50,000.

Salary in 2025: ₹80,000.

Estimate salary in 2023.

Solution:

Total gap: 5 years (2020 → 2025).

Distance from 2020 to 2023: 3 years.

Ratio: 35 .

Using Interpolation Formula:

3
Y = 50, 000 + (80, 000 − 50, 000)
5

3
= 50, 000 + × 30, 000
5

= 50, 000 + 18, 000 = ₹68, 000

3/5
Estimated salary in 2023 = ₹68,000.

5. Practical Uses of Linear Interpolation

Field Use Case

Weather Prediction Estimate temperature between recorded data points.

Finance Estimate stock prices between two known values.

Engineering Predict voltage, pressure, or force between measurements.

Machine Learning Fill missing data points in datasets.

Gaming & Graphics Smooth animations & scaling.

6. Key Takeaways

Concept Definition

Linear Interpolation Estimating a value between two known points.


x
Formula Y =B+ total distance
​ (A − B)
Key Assumption Values change linearly between points.

Common Applications Weather forecasting, finance, AI, engineering.

7. What’s Next?
Next Topic: Percentiles & Quartiles.

Prerequisite: Understand interpolation because percentiles require estimating ranks


within datasets.

4/5
🔹 Key Takeaway: Linear interpolation is a powerful tool for estimating unknown values
between two known points, widely used in science, finance, and AI! 🚀

5/5
Extra Detailed Notes on Percentiles,
Quartiles, and Linear Interpolation

1. Introduction to Percentiles
Definition
A percentile is a measure that indicates the relative standing of a value within a
dataset.

It tells us the percentage of data points below a given value.

Key Concept
Percentile ≠ Percentage Score

If a student scores in the 80th percentile, it means they performed better than 80%
of students—not that they scored 80% marks.

Example: If a baby’s weight is in the 90th percentile, it means the baby is heavier
than 90% of other babies of the same age.

2. Understanding Quartiles (Q1, Q2, Q3)


Quartiles divide data into four equal parts.

Key Quartiles:

Q1 (First Quartile / 25th Percentile): 25% of data lies below this value.

Q2 (Second Quartile / 50th Percentile / Median): 50% of data lies below this value.

Q3 (Third Quartile / 75th Percentile): 75% of data lies below this value.

1/5
Quartile Percentile Equivalent Meaning

Q1 25th percentile 25% of data falls below Q1

Q2 (Median) 50th percentile 50% of data falls below Q2

Q3 75th percentile 75% of data falls below Q3

Relation to Data Spread


The wider the gap between quartiles, the more spread out the data.

Used in Box Plots for visualizing data distribution.

3. Calculating Percentiles
Step 1: Sort the Data in Ascending Order
Percentiles work with ranked (ordered) data.

Step 2: Compute Rank Index


Formula:

P
I= × (N + 1)
100

Where:

P = Desired percentile.
N = Total number of values in the dataset.
I = Rank index (can be an integer or decimal).

4. Interpreting the Rank Index


Case 1: Integer Rank (Exact Value)
If I is an integer, the percentile value is the data point at that position.

2/5
Case 2: Decimal Rank (Interpolation Needed)
If I is not an integer, the percentile value is estimated using linear interpolation.

Linear Interpolation Formula

Y = Vlower + ((I − Ilower ) × (Vupper − Vlower ))


​ ​ ​ ​

Where:

Vlower = Value at lower rank.


Vupper = Value at upper rank.


Ilower = Lower rank.


I = Computed rank index.

5. Example Calculations
Example 1: Finding Q1 (25th Percentile)

Given Sorted Data:

15, 18, 21, 34, 38, 40, 45, 50, 50, 55, 60, 65, 70, 75, 80, 95

Total values: N = 16.


Finding Q1 (25th percentile):

25
I= × (16 + 1) = 4.25
100

Rank 4 = 34.

Rank 5 = 38.

Interpolation:

Y = 34 + (0.25 × (38 − 34))

= 34 + (0.25 × 4) = 34 + 1 = 35
Result: Q1 = 35.

3/5
Example 2: Finding Q2 (50th Percentile / Median)
Finding Q2 (50th percentile):

50
I= × (16 + 1) = 8.5
100

Rank 8 = 50.

Rank 9 = 50.

Since both values are equal, Q2 = 50.

Example 3: Finding Q3 (75th Percentile)


Finding Q3 (75th percentile):

75
I= × (16 + 1) = 12.75
100

Rank 12 = 65.

Rank 13 = 70.

Interpolation:

Y = 65 + (0.75 × (70 − 65))

= 65 + (0.75 × 5) = 65 + 3.75 = 68.75


Result: Q3 = 68.75.

6. Key Takeaways

Concept Definition

Percentile Percentage of data values below a given value.

Quartiles (Q1, Q2, Q3) Divide data into four equal parts.
P
Rank Index Formula I= 100
​ × (N + 1)
Linear Interpolation Used when I is not an integer.

4/5
Concept Definition

Q2 = Median 50th percentile is the median.

7. Next Steps
Next Topic: Box Plots & Outlier Detection.

Preparation:

Review quartiles & percentiles.

Understand how data spread affects interpretation.

🔹 Key Takeaway: Percentiles and quartiles help in ranking and understanding data
spread. When rank falls between two values, linear interpolation is used! 🚀

5/5
Extra Detailed Notes on Percentiles,
Quartiles & Google Sheets Functions

1. Introduction to Percentiles and Quartiles


Definition
Percentile: A value below which a given percentage of data falls.

Quartile: Special percentiles dividing data into four equal parts.

Key Quartiles
Q1 (First Quartile / 25th Percentile) → 25% of values below.

Q2 (Median / 50th Percentile) → 50% of values below.

Q3 (Third Quartile / 75th Percentile) → 75% of values below.

Example:

If Q1 = 29, it means 25% of data is below 29.

If Q2 = 46.5, it means 50% of data is below 46.5.

If Q3 = 67.25, it means 75% of data is below 67.25.

2. Sorting Data for Percentile Calculations


Raw Data is Jumbled → First step is to sort it in ascending order.

Steps in Google Sheets:

1. Select the column.

2. Click Data → Sort Range.

3. Check "Data has a header row" (if applicable).

1/6
4. Select Ascending Order.

3. Finding Quartiles (Q1, Q2, Q3) Using Rank Index


Formula
Rank Index Formula
P
I= × (N + 1)
100

Where:

P = Percentile (e.g., 25 for Q1).


N = Number of values.
I = Rank index.

Example Calculation for Q1 (25th Percentile)

Step 1: Compute Rank Index

25
I= × (20 + 1) = 5.25
100

The 5th and 6th values in the sorted data are used.

Step 2: Identify Values at Ranks

5th value = 28

6th value = 32

Step 3: Linear Interpolation

Q1 = 28 + (0.25 × (32 − 28))

= 28 + (0.25 × 4) = 28 + 1 = 29

Result:
Q1 = 29 (25% of values are below 29).

2/6
Example Calculation for Q2 (50th Percentile / Median)

Step 1: Compute Rank Index

50
I= × (20 + 1) = 10.5
100

The 10th and 11th values in the sorted data are used.

Step 2: Identify Values at Ranks

10th value = 45

11th value = 48

Step 3: Linear Interpolation

Q2 = 45 + (0.5 × (48 − 45))

= 45 + (0.5 × 3) = 45 + 1.5 = 46.5

Result:
Q2 = 46.5 (50% of values are below 46.5).

Example Calculation for Q3 (75th Percentile)

Step 1: Compute Rank Index

75
I= × (20 + 1) = 15.75
100

The 15th and 16th values in the sorted data are used.

Step 2: Identify Values at Ranks

15th value = 65

16th value = 68

Step 3: Linear Interpolation

Q3 = 65 + (0.75 × (68 − 65))

= 65 + (0.75 × 3) = 65 + 2.25 = 67.25

3/6
Result:
Q3 = 67.25 (75% of values are below 67.25).

4. Minimum & Maximum Values

Statistic Value

Minimum (0th Percentile) 12

Q1 (25th Percentile) 29

Q2 (50th Percentile / Median) 46.5

Q3 (75th Percentile) 67.25

Maximum (100th Percentile) 90

5. Different Quartile Calculation Methods

Method Formula Used


P
Standard Method (Used in R, NumPy, SPSS) I= 100
​ × (N + 1)
P
Excel & Google Sheets Method I= 100
​ × (N − 1) + 1

Effect of Different Methods


The Excel/Google Sheets method sometimes produces slightly different quartiles.

Example (Q1 Calculation in Google Sheets):

Using (N − 1) formula: Q1 = 31 (instead of 29).


Q3 is also slightly higher.

6. Google Sheets Functions for Quartiles

4/6
Function: PERCENTILE.EXC (Exclusive Percentiles)
Uses N + 1 formula.
Formula:

excel

=PERCENTILE.EXC(A2:A21, 0.25) # Q1
=PERCENTILE.EXC(A2:A21, 0.50) # Q2 (Median)
=PERCENTILE.EXC(A2:A21, 0.75) # Q3

Excludes 0th and 100th percentile.

Function: PERCENTILE.INC (Inclusive Percentiles)


Uses N − 1 formula.
Formula:

excel

=PERCENTILE.INC(A2:A21, 0.25) # Q1
=PERCENTILE.INC(A2:A21, 0.50) # Q2 (Median)
=PERCENTILE.INC(A2:A21, 0.75) # Q3

Includes 0th and 100th percentile.

Function: MEDIAN (For Q2)


Directly calculates Q2 (Median).

Formula:

excel

=MEDIAN(A2:A21)

Function: MIN and MAX


Finds the lowest and highest values.

Formula:

excel

5/6
=MIN(A2:A21) # Minimum
=MAX(A2:A21) # Maximum

7. Summary & Key Takeaways

Concept Key Insight

Percentile Tells how much data falls below a given value.

Quartile Special percentiles: Q1 (25th), Q2 (50th), Q3 (75th).


P
Rank Index Formula I= 100
​ × (N + 1).
Linear Interpolation Used when rank is not an integer.

Google Sheets Functions PERCENTILE.EXC , PERCENTILE.INC , MEDIAN , MIN , MAX .

Different Methods Excel/Google Sheets use N − 1 method, while R & NumPy use N + 1.

8. What’s Next?
Upcoming Topic: Box Plots & Outlier Detection.

Preparation:

Understand interquartile range (IQR).

Learn about identifying extreme outliers using quartiles.

🔹 Key Takeaway: Percentiles & Quartiles help in ranking data, and Google Sheets
functions provide different ways to compute them. Always be mindful of which formula is
used! 🚀

6/6
Box plots
Grouped Data

1. Introduction to Box Plots


What is a Box Plot?
A box plot (also called a box-and-whisker plot) is a graphical representation of a
dataset's spread.

It shows median, quartiles, interquartile range (IQR), outliers, and skewness in the
data.

Used to compare distributions between multiple groups in a dataset.

Key Components of a Box Plot

Component Meaning

Box Represents IQR (middle 50%) of data

Whiskers Extend to 1.5 times the IQR from Q1 and Q3

Median (Q2 Line) Middle value of the dataset

Outliers Data points outside whiskers, considered unusual

Whisker Limits Extend up to minimum and maximum non-outlier values

2. Understanding the Interquartile Range (IQR)


Definition
IQR = Q3 - Q1

Represents the middle 50% of the data.

1/5
A large IQR indicates greater variability, while a small IQR suggests closely packed
data.

Whiskers & Outliers


Whiskers extend to 1.5 × IQR beyond Q1 and Q3.

Any values beyond whiskers are outliers.

3. Box Plots for Grouped Data


What is Grouped Data?
Instead of individual data points, data is grouped into categories or bins.

Example: Grouping age into 20s, 30s, 40s, etc..

Why Use Box Plots for Grouped Data?


Helps in comparing distributions across different categories.

Identifies trends, variations, and outliers within each group.

Example: Salary Distribution by Job Sector


X-Axis: Different job sectors (Education, Finance, Healthcare, Retail, Tech).

Y-Axis: Salary ranges.

Each job sector has its own box plot.

4. Insights from a Box Plot


A. Median Position
Median (Q2) tells the typical value of a dataset.

Comparing medians across categories helps in understanding which group has


higher/lower typical values.

2/5
B. Interquartile Range (IQR)
A wider IQR → More variation in salaries.

A narrower IQR → Salaries closely packed together.

C. Skewness of Data

Skewness Type Median Position Interpretation

Symmetric Median at center of box Data is evenly spread

Right-Skewed (Positive Skew) Median near bottom of box More high-value outliers

Left-Skewed (Negative Skew) Median near top of box More low-value outliers

D. Identifying Outliers
Outliers appear as dots beyond whiskers.

Example:

Tech sector might have outlier salaries (e.g., startup founders).

Education sector might have low-end outliers (e.g., part-time teachers).

5. Comparing Salary Distributions Across Sectors


A. Education Sector
Narrow box → Salaries are consistent.

Low median → Salaries relatively lower than other sectors.

Few outliers → Some very low and very high salaries.

B. Finance Sector
Higher median → Salaries are higher on average.

Moderate IQR → Some variation in salaries.

A few high outliers → Likely highly paid executives.

C. Healthcare Sector

3/5
Median slightly above center → Slight right skew.

Moderate IQR → Less variation in salaries than finance.

Few outliers → Not many extreme cases.

D. Retail Sector
Low median → One of the lowest paying sectors.

Small IQR → Most employees earn similar salaries.

Few outliers → Not much variation.

E. Tech Sector
One of the highest medians → High-paying industry.

Smaller IQR than Finance → Less variation.

Outliers at high end → Some extremely high salaries.

6. Box Plots & Data Interpretation Beyond Salaries


A. Movie Review Scores
X-Axis: Movies

Y-Axis: Review scores (out of 100).

Box Plot Insights:

Low IQR → Most people agree on the rating.

High IQR → Polarizing opinions on the movie.

Outliers → Some extreme good/bad ratings.

B. Student Test Scores


X-Axis: Different subjects

Y-Axis: Scores (out of 100).

Insights:

Low median for math → Math is hardest subject.

4/5
Wide IQR in English → High variation in performance.

Few high outliers in science → Some students excel.

7. Key Takeaways

Concept Insight

Box Plot Visualizes spread & skewness in data.

IQR Measures middle 50% range.

Whiskers Extend 1.5 × IQR beyond Q1 & Q3.

Outliers Values beyond whiskers.

Grouped Data Allows comparison across categories.

Skewness Right-skewed → More high values, Left-skewed → More low values.

8. Next Steps
Next Topic: Advanced statistical analysis with box plots.

Key Focus:

Detecting Outliers Mathematically.

Using Box Plots for Decision Making.

🔹 Key Takeaway: Box plots are powerful tools for comparing distributions, detecting
outliers, and understanding variability in grouped data! 🚀

5/5
Box Plots &
Outlier Detection

1. Introduction to Box Plots


Definition
A box plot (box-and-whisker plot) is a graphical tool used to summarize and visualize
the distribution of continuous data.

It provides a quick snapshot of:

Central tendency (Median, Q2)

Spread (IQR – Interquartile Range)

Variability

Presence of outliers

Alternative names:

Candlestick plots (in Google Sheets)

Box-and-whisker plots (in statistics)

Why Use a Box Plot?


Useful for comparing multiple distributions (e.g., salaries in different job sectors).

Detects skewness (whether data is symmetric or not).

Identifies potential outliers (values that lie outside the expected range).

2. Components of a Box Plot

1/6
Component Definition

Box (Middle 50% of Data) Represents Interquartile Range (IQR), which contains the middle
50% of values.

Q1 (First Quartile – 25th 25% of data falls below this value.


Percentile)

Q2 (Median – 50th Percentile) The middle value of the dataset.

Q3 (Third Quartile – 75th 75% of data falls below this value.


Percentile)

IQR (Interquartile Range) IQR = Q3 - Q1. Measures the spread of the middle 50% of data.

Whiskers Extend from Q1 and Q3 to 1.5 × IQR beyond the quartiles.

Outliers Data points beyond whiskers are considered outliers.

3. Understanding Interquartile Range (IQR)


Formula:

IQR = Q3 − Q1

The IQR represents the range where the middle 50% of data lies.

Smaller IQR → Data is closely packed.

Larger IQR → Data is more spread out.

How Whiskers are Defined


Upper whisker limit:

Q3 + 1.5 × IQR
Lower whisker limit:

Q1 − 1.5 × IQR
Data points beyond whiskers are outliers.

2/6
4. Example: Video Game Review Scores
Dataset (Review Scores from 15 Players)

73, 78, 79, 80, 81, 83, 84, 85, 87, 88, 90, 92, 95, 97, 99

Step 1: Compute the Five-Number Summary

Statistic Value

Minimum 73

Q1 (25th percentile) 78

Q2 (Median, 50th percentile) 85

Q3 (75th percentile) 92

Maximum 99

Step 2: Calculate IQR

IQR = Q3 − Q1 = 92 − 78 = 14

Step 3: Determine Whisker Limits


Upper whisker limit:

Q3 + 1.5 × IQR = 92 + (1.5 × 14) = 92 + 21 = 113


Lower whisker limit:

Q1 − 1.5 × IQR = 78 − (1.5 × 14) = 78 − 21 = 57

Step 4: Check for Outliers


Maximum value = 99 (within limit 113 ✅).
Minimum value = 73 (within limit 57 ✅).

Conclusion: No outliers in this dataset.

5. Visualizing the Box Plot


X-Axis: Players.

3/6
Y-Axis: Review scores (0-100 scale).

Plot Elements:

Box from Q1 (78) to Q3 (92).

Line inside the box at Q2 (85, the median).

Whiskers extending from 73 to 99.

No outliers in this case.

6. Effect of an Outlier
New Dataset with an Outlier

40, 73, 78, 79, 80, 81, 83, 84, 85, 87, 88, 90, 92, 95, 97, 99

Added outlier = 40.

Updated Calculations

Statistic Value

Minimum 40 (Outlier)

Q1 (25th percentile) 76.5

Q2 (Median, 50th percentile) 84.5

Q3 (75th percentile) 91.5

Maximum 99

Updated IQR Calculation

IQR = Q3 − Q1 = 91.5 − 76.5 = 15

Updated Whisker Limits


Upper whisker limit:

Q3 + 1.5 × IQR = 91.5 + (1.5 × 15) = 91.5 + 22.5 = 114


Lower whisker limit:

Q1 − 1.5 × IQR = 76.5 − (1.5 × 15) = 76.5 − 22.5 = 54

4/6
Outlier Identification
New minimum = 40 (which is below 54, making it an outlier).

Whiskers now extend only to 73 (smallest non-outlier value).

7. Interpretation of Outlier Impact

Observation Insight

Median shift from 85 to 84.5 Slightly affected by the outlier.

IQR increased from 14 to 15 Greater spread due to outlier.

Whisker now extends only to 73 Whiskers adjust to non-outlier values.

New outlier detected at 40 One player gave an extreme score.

Key Takeaways
Outliers do not significantly change the central tendency (median).

Outliers increase variability (IQR, spread of data).

Box plots effectively highlight unusual data points.

8. Summary & Key Learnings

Concept Definition

Box Plot Summarizes distribution of continuous data.

IQR (Interquartile Range) Middle 50% of values (Q3 - Q1).

Whiskers Extend 1.5 × IQR beyond Q1 & Q3.

Outliers Data points beyond whiskers.

Effect of Outliers Median barely changes, IQR increases.

5/6
9. What’s Next?
Upcoming Topic: Comparing Multiple Box Plots in Grouped Data.

Key Focus:

Comparing distributions across categories.

Identifying trends and outlier effects.

🔹 Key Takeaway: Box plots are essential tools for detecting data spread, skewness, and
outliers. They provide a clear view of the middle 50% of data and help identify extreme
values! 🚀

6/6
Box Plots for
Grouped Data & Outlier Detection

1. Introduction to Box Plots for Grouped Data


Definition
A box plot (box-and-whisker plot) is a statistical tool used to visualize the distribution,
variability, and outliers in a dataset.

In grouped data, multiple box plots are created to compare distributions across
different categories.

Example: Comparing blood pressure levels across different age groups.

Why Use Box Plots for Grouped Data?


Easier Comparison: Helps in understanding differences in distributions across
categories.

Identifies Variability: Some groups may have higher variability than others.

Detects Outliers: Highlights extreme values within each category.

Summarizes Large Data: Converts large datasets into an easy-to-interpret visual.

2. Five-Number Summary for Box Plot Construction


To construct a box plot, we need five key statistical values:

Component Definition

Minimum The smallest value in the dataset.

Q1 (First Quartile, 25th Percentile) 25% of the data falls below this value.

Q2 (Median, 50th Percentile) Middle value of the dataset.

1/7
Component Definition

Q3 (Third Quartile, 75th Percentile) 75% of the data falls below this value.

Maximum The largest value in the dataset.

The box plot is based on quartiles (Q1, Q2, Q3) and the interquartile range (IQR).

3. Interquartile Range (IQR) & Whisker Limits


What is the IQR?

IQR = Q3 − Q1

The IQR represents the middle 50% of the data.

Higher IQR → Data is more spread out.

Lower IQR → Data is more clustered together.

Whisker Limits & Outlier Detection


Upper Fence (Maximum Expected Value):

Q3 + 1.5 × IQR
Lower Fence (Minimum Expected Value):

Q1 − 1.5 × IQR
Any values beyond these fences are considered outliers.

4. Box Plot Example: Blood Pressure Across Age Groups


Dataset Description
We have a dataset of 500 people with their age and blood pressure.

Age groups are categorized as:

20s (20-29 years)

2/7
30s (30-39 years)

40s (40-49 years)

50s (50-59 years)

60s (60-69 years)

70+ (70 years and above)

Step 1: Grouping Data


The dataset has raw age values.

We categorize them into age groups:

If age ≤ 29 → "20s"

If 30 ≤ age ≤ 39 → "30s"

If 40 ≤ age ≤ 49 → "40s"

If 50 ≤ age ≤ 59 → "50s"

If 60 ≤ age ≤ 69 → "60s"

If age ≥ 70 → "70+"

Step 2: Filtering Data


We use a filter function in Google Sheets:

excel

=FILTER(BloodPressure, AgeGroup="20s")

This extracts blood pressure values only for 20s group.

We repeat this for each age group.

Step 3: Computing Five-Number Summary for Each Group

Age Group Min Q1 Median (Q2) Q3 Max

20s 85 110 120 130 145

30s 94 112 120 127 150

40s 98 115 125 135 160

50s 100 118 130 140 165

3/7
Age Group Min Q1 Median (Q2) Q3 Max

60s 105 120 135 150 175

70+ 110 125 140 160 185

Step 4: Calculating Whiskers & Outliers


IQR for 30s:

IQR = Q3 − Q1 = 127 − 112 = 15


Upper Whisker Limit:

127 + (1.5 × 15) = 149.5


Lower Whisker Limit:

112 − (1.5 × 15) = 89.5


Outliers:

Any value above 149.5 or below 89.5 is an outlier.

150 and 85 are outliers in the 30s group.

5. Box Plot Interpretation for Age Groups


A. Age Group 20s
Median = 120 (centered in normal range).

No significant outliers.

Wide IQR → Indicates variability in blood pressure.

B. Age Group 30s


Median = 120.

Smallest IQR among all groups → Least variation in blood pressure.

Few outliers at the high end (150).

C. Age Group 40s


Median = 125.

4/7
Higher IQR than 30s → More variability.

Slightly skewed distribution.

D. Age Group 50s


Median = 130.

Large IQR → Higher variability in blood pressure.

Outliers start appearing frequently.

E. Age Group 60s & 70+


Largest IQRs → Most variability in blood pressure.

Median shifts to 135+.

Several high outliers beyond 160+.

6. Visualizing the Box Plot in Google Sheets


Steps to Create a Box Plot
1. Select the five-number summary data.

2. Go to Insert → Chart.

3. Choose "Candlestick Chart" (used for Box Plots in Google Sheets).

4. Modify chart settings:

Set Age Groups as X-axis.

Set Blood Pressure Ranges as Y-axis.

Check "Swap rows/columns" if needed.

7. Key Insights from Box Plots

5/7
Observation Interpretation

20s & 30s have the smallest Blood pressure is more stable in younger individuals.
IQRs

40s & 50s show increased IQR More variation in blood pressure, likely due to lifestyle and health
factors.

60s & 70+ have the largest Older individuals have widely varying blood pressure levels.
IQRs

Outliers in 30s & 70+ groups Unusual cases, possibly due to high blood pressure disorders.

8. Summary & Key Takeaways

Concept Key Insight

Box Plot A powerful tool for visualizing data distribution.

IQR Middle 50% of values, used to determine spread.

Whiskers Extend 1.5 × IQR beyond Q1 & Q3.

Outliers Values beyond whiskers, indicating unusual observations.

Grouped Box Plots Useful for comparing distributions across categories.

9. Next Steps
Upcoming Topic: Comparing Multiple Box Plots for Statistical Analysis.

Focus Areas:

Detecting Skewness & Variability

Using Box Plots for Medical & Financial Analysis

6/7
🔹 Key Takeaway: Box plots provide a clear way to analyze data distribution, detect
outliers, and compare groups. They are widely used in data analysis, healthcare, and
finance! 🚀

7/7

You might also like