FSP Notes
FSP Notes
1. What is Statistics?
Definition: Statistics is the art of learning from data.
The study of statistics can be broken down into four main stages:
2. Data Collection
First step in statistics: You need data to analyze before making any conclusions.
2. Conducting Experiments:
1/5
3. Example: Experiment on Teaching Styles
Scenario: A teacher wants to determine which teaching style is more effective.
Approach:
If Group 1 consists of novice learners and Group 2 has experienced students, then
differences in scores might not be due to teaching styles.
4. Data Description
After collecting data, it must be organized and summarized.
This includes:
Summary Measures:
Mean (Average)
The branch of statistics that deals with data description is called Descriptive Statistics.
2/5
The Role of Probability in Statistical Analysis
Could results occur just by chance?
Example:
Example: If we want to find the average age of all residents in a town, the entire town’s
residents make up the population.
What is a Sample?
Definition: A subset of the population that is examined to make conclusions about the
whole.
Example: Instead of surveying all residents in the town, we randomly select 100 people
and use their ages to estimate the town's average age.
Sampling Bias
A sample must be representative of the population.
If we collect data from only people entering a library, the sample may
overrepresent younger people.
3/5
7. Example: Election Polling
Scenario: Predicting the winner between Party A and Party B.
C) Obtain a voter registration list and randomly select 100 names. ✅ (Best method
—random selection of eligible voters.)
Problem: During the Great Depression, only wealthy people owned cars and
phones.
Result: The sample did not represent the whole population, leading to incorrect
predictions.
Lesson Learned:
4/5
1. Data Collection
2. Data Description
3. Data Analysis
Real-world case studies (like the 1936 US election polling failure) show the importance
of proper sampling methods.
9. What’s Next?
In the next session, we will dive deeper into Descriptive Statistics:
5/5
Descriptive Statistics: Organizing &
Visualizing Data
1. Introduction to Descriptive Statistics
Descriptive statistics involves organizing and summarizing data.
2. Organizing Data
Example Scenario: Streaming Platform Preferences
A survey was conducted with 60 students to determine their preferred streaming
platform.
The options included: Netflix, Prime, Disney Hotstar, YouTube, Z5, and Sony Liv.
Frequency Table
Definition: A table displaying each category and the number of occurrences (frequency).
Example Structure:
Netflix 12
Prime 10
Hotstar 15
1/6
Streaming Platform Frequency (Votes)
YouTube 8
Z5 9
Sony Liv 6
3. Visualizing Data
Types of Graphs
Formula:
Frequency of a category
Relative Frequency =
Total number of values (N)
Example Calculation:
2/6
Netflix: 12
60
= 0.20 (20%)
Hotstar: 15
60
= 0.25 (25%)
Visual Representation: Pie Chart
Class Boundaries
Example (Marks Distribution):
Interval: 35–40
3/6
Total of all Y-values should sum to 1 (or 100%).
Example:
0 - 10 5 5
11 - 20 8 13 (5+8)
21 - 30 12 25 (5+8+12)
Formula:
Cumulative Frequency
Relative Cumulative Frequency =
Total Values
4/6
Data: Lifetimes of 200 bulbs ranging from 500 to 1500 hours.
Steps:
If 51% of bulbs last 1000 hours or less, then 1000 hours is a good warranty period.
Reliability Calculation
Example:
Visualizing Data:
Grouped Data:
Cumulative Frequency:
Real-World Applications:
5/6
This breakdown includes every major concept discussed in your file while maintaining extra
details for better understanding. Let me know if you need anything refined! 🚀
2/2
6/6
Central Tendency
Formula:
∑ xi
ˉ) =
M ean(x
N
Where:
Example Calculation:
1/6
Data: 10, 15, 20
Mean:
10 + 15 + 20 45
= = 15
3 3
512
= 25.6 GB
20
Conclusion:
However, most students use less than this due to the 100 GB outlier.
N +1
2
If N is even: The median is the average of the two middle values.
2/6
Even N: Take the average of the two middle values.
5, 8, 10, 10, 12, 12, 12, 15, 15, 15, 18, 18, 20, 20, 22, 25, 30, 35, 40, 100
2. Middle Position:
N = 20 (Even)
Middle two values: 10th and 11th values → 15 & 18.
Median:
15 + 18
= 16.5 GB
2
3. Conclusion:
12 occurs 3 times.
15 occurs 3 times.
3/6
Modes: 12 GB & 15 GB (Bimodal Dataset).
Conclusion:
Mode shows the most common data usage trends.
4/6
Salaries in a Company (Bill Gates example – outliers affect mean).
Real-Life Example:
10 Microsoft employees:
A telecom provider should offer 12GB & 15GB plans, as those are the most used.
A telecom company sees Mean = 25.6 GB and assumes 30GB plans are best.
Issue? Most students use 12GB or 15GB → 30GB plan isn't necessary.
5/6
8. Recap & Final Thoughts
Mean: Best for overall trends, but affected by outliers.
9. What’s Next?
Upcoming Topic: Sample Mean & Variance Calculations
Stay tuned! 🚀
6/6
Mean,
Variance, and Standard Deviation
∑ xi
μ=
N
Where:
xi = individual values.
1/7
30, 50, 50, 60, 60, 60, 70, 80, 90, 90
Mean Calculation:
630
= 63
10
Notations:
Population mean: μ.
ˉ.
Sample mean: x
30 1
40 0
50 2
60 3
70 1
80 1
90 2
∑(xi ⋅ fi )
μ=
∑ fi
Mean Calculation:
2/7
Mean = 630
10
= 63.
Example
Consider two datasets:
3/7
5. Variance: Measuring Data Spread
Definition
Variance (σ 2 ) tells us how far values are from the mean.
2 ∑(xi − μ)2
σ =
N
Where:
μ = Mean.
N = Total number of values.
Variance for Sample Data:
2 ˉ )2
∑(xi − x
s =
N −1
Example Calculation
Marks: 30, 50, 50, 60, 60, 60, 70, 80, 90, 90.
Value (xi )
xi − μ
(xi − μ)2
30 -33 1089
50 -13 169
50 -13 169
60 -3 9
60 -3 9
60 -3 9
4/7
Value (xi )
xi − μ
(xi − μ)2
70 7 49
80 17 289
90 27 729
90 27 729
Total 0 2180
2180
σ2 = = 218
10
2180 2180
s2 = = = 242.2
10 − 1 9
6. Standard Deviation (σ )
Definition
Square root of variance.
Formula:
σ= σ2
For population:
σ= 218 ≈ 14.76
For sample:
s= 242.2 ≈ 15.56
5/7
Interpretation:
N N −1
Use Case When entire population is available When using a subset (sample)
Key formulas:
σ 2 (Population Variance).
s2 (Sample Variance).
σ (Population Standard Deviation).
6/7
s (Sample Standard Deviation).
Next Session: Advanced statistics concepts (skewness, kurtosis).
Stay tuned! 🚀
7/7
Extra Detailed Notes on Normal
Distribution & Central Limit Theorem
1. Introduction
This session covers:
Central Limit Theorem (CLT): How larger sample sizes lead to normal distributions.
Key Feature: Most values are near the mean, with fewer occurring further away.
Graphical Representation
X-axis: Data values.
The peak represents the mean, median, and mode (which are equal in a perfect normal
distribution).
1/6
3. Properties of a Normal Distribution
1. Symmetric Around the Mean:
The left and right sides of the curve are mirror images.
Though data points at extremes are rare, they never reach zero.
2/6
Mean < Median < Mode.
5. Bimodal Distributions
Definition: A distribution with two peaks.
Example:
IQ scores in a population.
3/6
Lifespan of light bulbs.
Key Implication
Even if data is skewed, taking many random samples and calculating the means will
eventually create a normal distribution.
Example: Examining 100 student scores instead of the entire country’s scores.
Formula:
1
P (∣X − μ∣ ≥ kσ) ≤
k2
4/6
At k = 3 (3σ range), at most 11.1% of values are outside.
9. Summary
Concept Description
Empirical Rule 68% within 1σ, 95% within 2σ, 99.7% within 3σ.
Chebyshev’s Inequality Works for any distribution but gives upper bounds only.
Key Focus:
5/6
🔹 Key Takeaway: Many real-world data follow normal distribution, and the Central Limit
Theorem explains why. This is fundamental for statistical analysis! 🚀
6/6
Extra Detailed Notes on Chebyshev’s
Inequality & Probability Applications
1. Introduction
Recap of previous topics:
Provides an upper bound on how much data deviates from the mean.
Formula:
∑(xi − μ)2
σ2 =
N
Unit of variance: Square of the data’s unit.
Standard Deviation (σ )
1/6
Square root of variance:
σ= σ2
Helps understand how much values typically deviate from the mean.
When dataset is large, the frequency function turns into a probability density
function (PDF).
Formula
1
P (∣X − μ∣ ≥ kσ) ≤
k2
Where:
X = Random variable.
2/6
μ = Mean.
σ = Standard deviation.
k = Number of standard deviations away from the mean.
Meaning
Probability of being beyond k standard deviations from the mean is at most k12 .
Fraudulent range:
200 − (3 × 50) = 50
200 + (3 × 50) = 350
Transactions below $50 or above $350 are suspicious.
1 1
P (∣X − 200∣ ≥ 3(50)) ≤ = ≈ 11.1%
32 9
Interpretation:
3/6
AI flags these transactions for further review.
1 1
P (∣X − 30∣ ≥ 2(5)) ≤ = = 25%
22 4
Interpretation:
Threshold:
30 + (3 × 5) = 45
Using Chebyshev’s Formula:
4/6
1 1
P (∣X − 30∣ ≥ 3(5)) ≤ = = 11.1%
32 9
Conclusion:
Feature Benefit
Key Takeaway: Chebyshev’s rule sets a safe boundary for extreme values, even if the
dataset isn’t normal.
5/6
Chebyshev’s Inequality:
Real-World Applications:
Next Session:
🔹 Key Takeaway: Chebyshev’s inequality helps analyze extreme values in any dataset,
making it useful in AI, fraud detection, and logistics. 🚀
6/6
Extra Detailed Notes on Types of Data
in Statistics
1. Introduction
Why understanding data types is important?
Classification of Data:
Subtypes:
1. Discrete Data
2. Continuous Data
1. Discrete Data
Definition: Data that can take only specific, countable values (no in-between values).
Examples:
1/5
Steps taken per day (1000, 1500, 2000 but not 1750.8).
2. Continuous Data
Definition: Data that can take any value within a range, including decimals and
fractions.
Examples:
Subtypes:
1. Nominal Data
2. Ordinal Data
Examples:
Definition: Categorical data that has a meaningful order but unequal spacing between
categories.
Examples:
2/5
Customer satisfaction ratings:
Education levels:
Ranking in a competition:
Nominal (Qualitative) Categories with no specific order Eye color, blood type, city names
Number of cars sold Discrete Countable whole numbers (e.g., 15, 22)
Time spent on a website Continuous Any value within a range (e.g., 3.45 seconds)
Favorite ice cream flavors Nominal No inherent order (Vanilla, Chocolate, Strawberry)
3/5
Scenario Data Type Explanation
Students’ heights in a class Continuous Any value within a range (e.g., 165.3 cm)
Visualization Techniques:
7. Quiz Questions
Identify the Data Type
1. The number of students in a class: __________?
4/5
2. The brand of smartphone a person uses: __________?
Answers:
8. What’s Next?
Upcoming Topic: Percentiles & Quartiles.
Preparation:
🔹 Key Takeaway: Understanding data types is essential for choosing the right statistical
tools and interpreting results correctly! 🚀
5/5
Extra Detailed Notes on Paired Data,
Scatter Plots & Correlation
Example:
Paired dataset: Recording caffeine consumption & heart rate of the same 1000
people.
Social Media & Sleep Time spent on Social Media (hours) Sleep Duration (hours)
Studying & Exam Performance Study Time (hours) Exam Score (%)
Weather & Beverage Sales Temperature (°C) Hot Beverage Sales (₹)
1/6
3. Visualizing Paired Data: Scatter Plots
Definition
A scatter plot is a graph where:
2/6
Example:
Example:
Example:
3/6
Value of r Strength of Correlation Type
0 No Correlation None
Examples
r = +0.93 → Strong positive correlation (Study Time vs Exam Scores).
r = −0.71 → Moderate negative correlation (Social Media vs Sleep).
r = 0.01 → No correlation (Shoe Size vs Grocery Spending).
6. Correlation ≠ Causation
Key Concept
Just because two variables are correlated doesn’t mean one causes the other.
4/6
7. How to Calculate Correlation in Google Sheets
Using the CORREL Function
1. Select a blank cell.
2. Type:
excel
=CORREL(A2:A1000, B2:B1000)
3. Press Enter.
Example Calculations
Concept Definition
Correlation ≠ Causation Correlation does not imply one variable causes the other.
5/6
9. What’s Next?
Upcoming Topics:
🔹 Key Takeaway: Paired data, scatter plots, and correlation are powerful tools to analyze
relationships between variables. However, correlation alone is not proof of causation! 🚀
6/6
Linear
Interpolation
1. Introduction
Why is interpolation important?
Real-world applications:
0 → 1 → 2 → 3 → 4.
Suppose at 0 → height is A.
1/5
Suppose at 4 → height is B.
3
B + (A − B)
4
Formula:
x
Y =B+ (A − B)
total distance
Where:
Y = Interpolated value.
A = Value at one endpoint.
B = Value at the other endpoint.
x = Distance from B to the required point.
total distance = Distance between A and B.
4. Example Calculations
Example 1: Temperature Estimation
Given Data:
Solution:
2/5
Ratio of time:
Ratio: 24 = 12 .
2
Y = 20 + (30 − 20)
4
2
= 20 + × 10
4
= 20 + 5 = 25°C
Given Data:
Solution:
Ratio: 35 .
3
Y = 50, 000 + (80, 000 − 50, 000)
5
3
= 50, 000 + × 30, 000
5
3/5
Estimated salary in 2023 = ₹68,000.
6. Key Takeaways
Concept Definition
7. What’s Next?
Next Topic: Percentiles & Quartiles.
4/5
🔹 Key Takeaway: Linear interpolation is a powerful tool for estimating unknown values
between two known points, widely used in science, finance, and AI! 🚀
5/5
Extra Detailed Notes on Percentiles,
Quartiles, and Linear Interpolation
1. Introduction to Percentiles
Definition
A percentile is a measure that indicates the relative standing of a value within a
dataset.
Key Concept
Percentile ≠ Percentage Score
If a student scores in the 80th percentile, it means they performed better than 80%
of students—not that they scored 80% marks.
Example: If a baby’s weight is in the 90th percentile, it means the baby is heavier
than 90% of other babies of the same age.
Key Quartiles:
Q1 (First Quartile / 25th Percentile): 25% of data lies below this value.
Q2 (Second Quartile / 50th Percentile / Median): 50% of data lies below this value.
Q3 (Third Quartile / 75th Percentile): 75% of data lies below this value.
1/5
Quartile Percentile Equivalent Meaning
3. Calculating Percentiles
Step 1: Sort the Data in Ascending Order
Percentiles work with ranked (ordered) data.
P
I= × (N + 1)
100
Where:
P = Desired percentile.
N = Total number of values in the dataset.
I = Rank index (can be an integer or decimal).
2/5
Case 2: Decimal Rank (Interpolation Needed)
If I is not an integer, the percentile value is estimated using linear interpolation.
Where:
5. Example Calculations
Example 1: Finding Q1 (25th Percentile)
15, 18, 21, 34, 38, 40, 45, 50, 50, 55, 60, 65, 70, 75, 80, 95
25
I= × (16 + 1) = 4.25
100
Rank 4 = 34.
Rank 5 = 38.
Interpolation:
= 34 + (0.25 × 4) = 34 + 1 = 35
Result: Q1 = 35.
3/5
Example 2: Finding Q2 (50th Percentile / Median)
Finding Q2 (50th percentile):
50
I= × (16 + 1) = 8.5
100
Rank 8 = 50.
Rank 9 = 50.
75
I= × (16 + 1) = 12.75
100
Rank 12 = 65.
Rank 13 = 70.
Interpolation:
6. Key Takeaways
Concept Definition
Quartiles (Q1, Q2, Q3) Divide data into four equal parts.
P
Rank Index Formula I= 100
× (N + 1)
Linear Interpolation Used when I is not an integer.
4/5
Concept Definition
7. Next Steps
Next Topic: Box Plots & Outlier Detection.
Preparation:
🔹 Key Takeaway: Percentiles and quartiles help in ranking and understanding data
spread. When rank falls between two values, linear interpolation is used! 🚀
5/5
Extra Detailed Notes on Percentiles,
Quartiles & Google Sheets Functions
Key Quartiles
Q1 (First Quartile / 25th Percentile) → 25% of values below.
Example:
1/6
4. Select Ascending Order.
Where:
25
I= × (20 + 1) = 5.25
100
The 5th and 6th values in the sorted data are used.
5th value = 28
6th value = 32
= 28 + (0.25 × 4) = 28 + 1 = 29
Result:
Q1 = 29 (25% of values are below 29).
2/6
Example Calculation for Q2 (50th Percentile / Median)
50
I= × (20 + 1) = 10.5
100
The 10th and 11th values in the sorted data are used.
10th value = 45
11th value = 48
Result:
Q2 = 46.5 (50% of values are below 46.5).
75
I= × (20 + 1) = 15.75
100
The 15th and 16th values in the sorted data are used.
15th value = 65
16th value = 68
3/6
Result:
Q3 = 67.25 (75% of values are below 67.25).
Statistic Value
Q1 (25th Percentile) 29
4/6
Function: PERCENTILE.EXC (Exclusive Percentiles)
Uses N + 1 formula.
Formula:
excel
=PERCENTILE.EXC(A2:A21, 0.25) # Q1
=PERCENTILE.EXC(A2:A21, 0.50) # Q2 (Median)
=PERCENTILE.EXC(A2:A21, 0.75) # Q3
excel
=PERCENTILE.INC(A2:A21, 0.25) # Q1
=PERCENTILE.INC(A2:A21, 0.50) # Q2 (Median)
=PERCENTILE.INC(A2:A21, 0.75) # Q3
Formula:
excel
=MEDIAN(A2:A21)
Formula:
excel
5/6
=MIN(A2:A21) # Minimum
=MAX(A2:A21) # Maximum
Different Methods Excel/Google Sheets use N − 1 method, while R & NumPy use N + 1.
8. What’s Next?
Upcoming Topic: Box Plots & Outlier Detection.
Preparation:
🔹 Key Takeaway: Percentiles & Quartiles help in ranking data, and Google Sheets
functions provide different ways to compute them. Always be mindful of which formula is
used! 🚀
6/6
Box plots
Grouped Data
It shows median, quartiles, interquartile range (IQR), outliers, and skewness in the
data.
Component Meaning
1/5
A large IQR indicates greater variability, while a small IQR suggests closely packed
data.
2/5
B. Interquartile Range (IQR)
A wider IQR → More variation in salaries.
C. Skewness of Data
Right-Skewed (Positive Skew) Median near bottom of box More high-value outliers
Left-Skewed (Negative Skew) Median near top of box More low-value outliers
D. Identifying Outliers
Outliers appear as dots beyond whiskers.
Example:
B. Finance Sector
Higher median → Salaries are higher on average.
C. Healthcare Sector
3/5
Median slightly above center → Slight right skew.
D. Retail Sector
Low median → One of the lowest paying sectors.
E. Tech Sector
One of the highest medians → High-paying industry.
Insights:
4/5
Wide IQR in English → High variation in performance.
7. Key Takeaways
Concept Insight
8. Next Steps
Next Topic: Advanced statistical analysis with box plots.
Key Focus:
🔹 Key Takeaway: Box plots are powerful tools for comparing distributions, detecting
outliers, and understanding variability in grouped data! 🚀
5/5
Box Plots &
Outlier Detection
Variability
Presence of outliers
Alternative names:
Identifies potential outliers (values that lie outside the expected range).
1/6
Component Definition
Box (Middle 50% of Data) Represents Interquartile Range (IQR), which contains the middle
50% of values.
IQR (Interquartile Range) IQR = Q3 - Q1. Measures the spread of the middle 50% of data.
IQR = Q3 − Q1
The IQR represents the range where the middle 50% of data lies.
Q3 + 1.5 × IQR
Lower whisker limit:
Q1 − 1.5 × IQR
Data points beyond whiskers are outliers.
2/6
4. Example: Video Game Review Scores
Dataset (Review Scores from 15 Players)
73, 78, 79, 80, 81, 83, 84, 85, 87, 88, 90, 92, 95, 97, 99
Statistic Value
Minimum 73
Q1 (25th percentile) 78
Q3 (75th percentile) 92
Maximum 99
IQR = Q3 − Q1 = 92 − 78 = 14
3/6
Y-Axis: Review scores (0-100 scale).
Plot Elements:
6. Effect of an Outlier
New Dataset with an Outlier
40, 73, 78, 79, 80, 81, 83, 84, 85, 87, 88, 90, 92, 95, 97, 99
Updated Calculations
Statistic Value
Minimum 40 (Outlier)
Maximum 99
4/6
Outlier Identification
New minimum = 40 (which is below 54, making it an outlier).
Observation Insight
Key Takeaways
Outliers do not significantly change the central tendency (median).
Concept Definition
5/6
9. What’s Next?
Upcoming Topic: Comparing Multiple Box Plots in Grouped Data.
Key Focus:
🔹 Key Takeaway: Box plots are essential tools for detecting data spread, skewness, and
outliers. They provide a clear view of the middle 50% of data and help identify extreme
values! 🚀
6/6
Box Plots for
Grouped Data & Outlier Detection
In grouped data, multiple box plots are created to compare distributions across
different categories.
Identifies Variability: Some groups may have higher variability than others.
Component Definition
Q1 (First Quartile, 25th Percentile) 25% of the data falls below this value.
1/7
Component Definition
Q3 (Third Quartile, 75th Percentile) 75% of the data falls below this value.
The box plot is based on quartiles (Q1, Q2, Q3) and the interquartile range (IQR).
IQR = Q3 − Q1
Q3 + 1.5 × IQR
Lower Fence (Minimum Expected Value):
Q1 − 1.5 × IQR
Any values beyond these fences are considered outliers.
2/7
30s (30-39 years)
If age ≤ 29 → "20s"
If 30 ≤ age ≤ 39 → "30s"
If 40 ≤ age ≤ 49 → "40s"
If 50 ≤ age ≤ 59 → "50s"
If 60 ≤ age ≤ 69 → "60s"
If age ≥ 70 → "70+"
excel
=FILTER(BloodPressure, AgeGroup="20s")
3/7
Age Group Min Q1 Median (Q2) Q3 Max
No significant outliers.
4/7
Higher IQR than 30s → More variability.
2. Go to Insert → Chart.
5/7
Observation Interpretation
20s & 30s have the smallest Blood pressure is more stable in younger individuals.
IQRs
40s & 50s show increased IQR More variation in blood pressure, likely due to lifestyle and health
factors.
60s & 70+ have the largest Older individuals have widely varying blood pressure levels.
IQRs
Outliers in 30s & 70+ groups Unusual cases, possibly due to high blood pressure disorders.
9. Next Steps
Upcoming Topic: Comparing Multiple Box Plots for Statistical Analysis.
Focus Areas:
6/7
🔹 Key Takeaway: Box plots provide a clear way to analyze data distribution, detect
outliers, and compare groups. They are widely used in data analysis, healthcare, and
finance! 🚀
7/7