Statistics for Data Science: Complete
Study Guide
Essential Statistical Concepts for Data Analysis
Chapter 1: Descriptive Statistics
1.1 Measures of Central Tendency
Mean (µ): Sum of all values divided by count. Sensitive to outliers. Formula: µ = Σx/n Median: Middle
value when data is sorted. Robust to outliers. For even n: average of two middle values Mode: Most
frequently occurring value(s). Can be multimodal. Useful for categorical data.
1.2 Measures of Dispersion
Variance (σ²): Average squared deviation from mean Population: σ² = Σ(x-µ)²/N Sample: s² =
Σ(x-x■)²/(n-1) [Bessel's correction] Standard Deviation (σ): Square root of variance. Same units as
data. 68-95-99.7 rule for normal distributions Interquartile Range (IQR): Q3 - Q1 Robust measure of
spread. Used for outlier detection: outliers beyond Q1-1.5×IQR or Q3+1.5×IQR
1.3 Shape Measures
Skewness: Measure of asymmetry • Positive skew: tail extends right, mean > median • Negative skew:
tail extends left, mean < median • Zero skew: symmetric distribution Kurtosis: Measure of tail
heaviness • Leptokurtic: heavy tails, high peak (kurtosis > 3) • Platykurtic: light tails, flat peak (kurtosis <
3) • Mesokurtic: normal distribution (kurtosis = 3)
Chapter 2: Probability Theory
2.1 Basic Probability
Probability Axioms: 1. 0 ≤ P(A) ≤ 1 for any event A 2. P(S) = 1 where S is sample space 3. For
mutually exclusive events: P(A∪B) = P(A) + P(B) Conditional Probability: P(A|B) = P(A∩B)/P(B)
Bayes' Theorem: P(A|B) = P(B|A)×P(A)/P(B) Applications: spam filtering, medical diagnosis, A/B
testing
2.2 Probability Distributions
Distribution Type Parameters Use Cases
Bernoulli Discrete p (success prob) Binary outcomes
Binomial Discrete n, p Number of successes in n trials
Poisson Discrete λ (rate) Count of events in interval
Normal Continuous µ, σ Natural phenomena, CLT
Exponential Continuous λ (rate) Time between events
Chi-Square Continuous df Goodness of fit tests
Chapter 3: Statistical Inference
3.1 Hypothesis Testing
Steps: 1. State null (H■) and alternative (H■) hypotheses 2. Choose significance level (α, typically
0.05) 3. Select appropriate test statistic 4. Calculate p-value or critical value 5. Make decision: reject
H■ if p-value < α Types of Errors: • Type I: Reject true H■ (probability = α) • Type II: Fail to reject
false H■ (probability = β) • Power = 1 - β (probability of detecting true effect)
3.2 Common Statistical Tests
Parametric Tests (assume normal distribution): • t-test: Compare means (one-sample, two-sample,
paired) • ANOVA: Compare means across multiple groups • Pearson correlation: Linear relationship
strength Non-parametric Tests (distribution-free): • Mann-Whitney U: Alternative to two-sample
t-test • Kruskal-Wallis: Alternative to ANOVA • Spearman correlation: Monotonic relationship •
Chi-square test: Independence or goodness of fit
3.3 Confidence Intervals
Interpretation: If we repeated sampling many times, (1-α)% of constructed intervals would contain true
parameter. For mean (known σ): x■ ± z_(α/2) × σ/√n For mean (unknown σ): x■ ± t_(α/2,df) × s/√n
For proportion: p■ ± z_(α/2) × √(p■(1-p■)/n) Width depends on: sample size (n), variability (σ),
confidence level (1-α)
Chapter 4: Regression Analysis
4.1 Linear Regression
Simple Linear Regression: y = β■ + β■x + ε • Least squares estimation minimizes Σ(y - ■)² •
Assumptions: linearity, independence, normality, homoscedasticity Multiple Linear Regression: y =
β■ + β■x■ + ... + β■x■ + ε • Adjusted R² penalizes additional predictors • Multicollinearity: high
correlation between predictors • Variable selection: forward, backward, stepwise
4.2 Model Evaluation
Metrics: • R²: Proportion of variance explained (0 to 1) • RMSE: √(Σ(y-■)²/n) - same units as y • MAE:
Σ|y-■|/n - robust to outliers • AIC/BIC: Balance fit and complexity Diagnostics: • Residual plots: check
assumptions • Q-Q plots: assess normality • Cook's distance: identify influential points • VIF: detect
multicollinearity (VIF > 10 problematic)
Chapter 5: Data Science Applications
5.1 A/B Testing
Design Considerations: • Sample size calculation: depends on effect size, power, significance •
Randomization: ensure comparable groups • Multiple testing correction: Bonferroni, FDR • Early
stopping: sequential testing methods Common Pitfalls: • Peeking at results (inflates Type I error) •
Simpson's paradox in segmented analysis • Novelty effects and seasonality
5.2 Time Series Analysis
Components: • Trend: long-term direction • Seasonality: regular periodic patterns • Cyclic: irregular
long-term fluctuations • Noise: random variation Methods: • Moving averages: smoothing • ARIMA:
autoregressive integrated moving average • Exponential smoothing: weighted averages •
Decomposition: separate components
Quick Reference: Statistical Formulas
Concept Formula When to Use
Standard Error (mean) SE = σ/√n Sampling distribution of mean
Z-score z = (x-µ)/σ Standardization, outlier detection
Correlation r = Cov(X,Y)/(σ■σ■) Linear relationship strength
Effect Size (Cohen's d) d = (µ■-µ■)/σ_pooled Practical significance
Chi-square statistic χ² = Σ(O-E)²/E Categorical data analysis
F-statistic F = MS_between/MS_within ANOVA, variance comparison
Remember: Statistics is about understanding uncertainty and making informed decisions from data!